Incidents
Full history of incidents.
September 2022
Some components related to ingestion of metrics and access logs are currently overloaded. We are working on it.
** 16:30 UTC **: Incident has been resolved
Some Elasticsearch add-ons are currently reporting a license expiration. The license is set to expire on 2022-09-30 23:59:59 UTC. Our team is currently working on it and the license of affected add-ons will be updated prior to the expiration date. We will update this incident once all add-ons are updated.
No service degradation is to be expected from this warning.
Please reach out to our support team should you have any questions regarding this matter.
EDIT 2022-09-29 17:30 UTC: A first license update has been applied. A new license update will be applied in the following days to finish the license update.
EDIT 2022-10-12 16:55 UTC: All licenses have been updated with a valid platinum license. The incident is over.
The service encountered an outage in the ingestion path which retained the access logs at the messaging-level. While being identified by the monitoring, this error unfortunately triggered as low-priority, hence being silent during a part of the week-end which led to a drop of access logs after a retention period. We have found the root issue and the problem is now resolved and should prevent further similar incident. Besides, we've fixed the level of criticity in our monitoring infrastructure.
26/09/2022 12:00 UTC: End of incident
A maintenance will occur on the Grafana used to plot Clever Cloud metrics the 09/27/2022 at 2:30 p.m. (CEST). We will update our instances to the last major release of Grafana: Grafana 9. You can check Grafana release post to learn what this change will bring to you: https://grafana.com/docs/grafana/latest/whatsnew/whats-new-in-v9-0/.
Applications creation might fail for Jenkins runners with HTTP 500 Internal Server Error. A fix will be soon deployed to fix the underlying issue.
EDIT 16:46 UTC: The fix has been deployed. We are monitoring the situation. This issue also impacted Heptapod runners creation.
EDIT 17:28 UTC: The issue has been fixed, runners creation are now working correctly. Sorry for the troubles.
We are currently seeing network loss between our Paris infrastructure and our zones on OVH (Roubaix, Montreal, ...). We are currently investigating the issue.
EDIT 11:50 UTC: First investigations are showing that it is not only a network issue between our Paris infrastructure and the OVH network. It seems to impact other network links as well. We will reach to OVH and try to know more about it.
EDIT 11:51 UTC: The incident has been renamed from "Network issues between Paris and OVH zones" to "Network issues on OVH zones"
EDIT 11:58 UTC. We are seeing improvements since a few minutes now. Connectivity has been restored from our point of view. We keep waiting for more information.
EDIT 12:09 UTC: We have not seen any new disruption so far. We consider this incident closed while we wait for a more detailed incident report from OVH.
EDIT 12:59 UTC: OVH status: https://network.status-ovhcloud.com/incidents/5mldyhd6v99c
Some HBase datanode have lost their regions all datanodes are OK
An hypervisor needs to be rebooted on our Montreal zone. The reboot will happen at 08:00 UTC on Friday, 16 September. Add-ons that support automatic migration will be migrated automatically starting at 07:30 UTC. You can also perform the migration at a time that suits you more before the given deadline.
This will also impact some FSBuckets add-ons during which reads and writes will be unavailable. Applications will be redeployed automatically once the maintenance is over to make sure they correctly re-connect to the FSBucket server.
The maintenance is expected to last 15 minutes.
Impacted users will shortly receive an email with the impacted add-ons.
** Edit 08:05 UTC ** Waiting for last migration to end
** Edit 08:25 UTC ** Last migration has ended, the maintenance is beginning
** Edit 08:35 UTC ** The server has rebooted successfully
** Edit 08:55 UTC ** Everything is up and running normally
A file-system bucket server was down during 6 minutes beginning at 12:24 UTC and ending at 12:30 UTC on Paris data center.
We have fix the issue and watching the service.
Some tokens used by our infrastructure have not been renewed. As a result, some vms cannot push their latest metrics We are working on it.
EDIT 10:38 UTC: All expired tokens have been regenerated and updated. Sorry for the inconvenience.
Our distributed database responsible for metrics and access-logs storage is not ingesting fast enough. As a result, you may experience some lags during queries. We are investigating.
EDIT 03/09/2022 12:10 UTC: lag is finally catching up, we will keep you posted.
EDIT 03/09/2022 16:10 UT: lag is fully recovered
August 2022
An hypervisor needs to be rebooted on our Paris zone. The reboot will happen at 22:00 UTC on Monday, 5th September. Add-ons that support automatic migration will be migrated automatically starting at 21:00 UTC. You can also perform the migration at a time that suits you more before the given deadline.
This will also impact some FSBuckets add-ons during which reads and writes will be unavailable. Applications will be redeployed automatically once the maintenance is over to make sure they correctly re-connect to the FSBucket server.
The maintenance is expected to last 15 minutes.
Impacted users will shortly receive an email with the impacted add-ons.
EDIT 2022-09-05 21:10 UTC: Add-ons migrations is starting
EDIT 2022-09-05 21:40 UTC: Add-ons have been migrated. The hypervisor reboot will happen in twenty minutes.
EDIT 2022-09-05 22:00 UTC: Hypervisor is rebooting
EDIT 2022-09-05 22:28 UTC: Hypervisor has been rebooted in 4 minutes, fsbucket server went back one minute later with most clients reconnecting. We started all affected applications to make sure everyone properly reconnects.
One hypervisor went down in MTL2. We are trying to reboot it.
It affects: 1 load balancer 1 redis add-on 1 mysql add-on The free postgresql databases on MTL.
Update 16:40 after investigating, we decide to redirect the IP of the load balancer to the second LB. A ticket is open at OVHCloud to investigate what seems to be a hardware issue. Update 17:56 OVHCloud team physically checked the server: the RAID card was broken. They changed it and restarted the server. Update 18:05 All VMs on the hypervisor are up and running again.
An hypervisor on the Paris zone is currently unreachable. We are looking into it.
EDIT 17:38 UTC: Hypervisor has been rebooted. Services are being restarted.
EDIT 18:08 UTC: Services have all been restarted. We continue looking into why the hypervisor went down and continue to monitor the situation.
EDIT 18:27 UTC: Initial investigation shows that a KVM kernel bug was encountered, leading to a kernel crash. We will investigate further to see if this can be mitigated by an update. The incident is now over.
We are seeing a network loss towards the New York zone from multiple places since 06:05 UTC. We are looking into the issue. Applications and add-ons may not be reachable from different places and multiple services on the zone (deployments, logs) will not be available.
EDIT 07:04 UTC: We are seeing network improvements to reach the zone. It is currently operational but we are still waiting on confirmation from our provider. From our point of view as of now, traffic towards the zone was dropped when reaching the Level3 network transit. Our network provider seems to have changed it to another provider, allowing us to reach the zone again.
EDIT 12:18 UTC. The network problem is fully resolved. We are still waiting for an incident report from the network operator of the Datacenter. We will share it once available.
EDIT 2022-08-26 14:27 UTC: Here is the report from our provider: It has been identified that the incident is due to a bug found in our device at DRT1. As an initial resolution, our team rebooted the device. Consequently, all alarms cleared and all services were restored after executing the said activity. As of the moment, we can confirm that the link has remained clean and error-free since the service went up.
We are investigating an unresponsive hypervisor on the Paris zone. An FSBucket server is on this hypervisor, some PHP applications may be impacted as well as add-ons hosted on this hypervisor.
EDIT 18:02 UTC+2: Hypervisor is rebooting
EDIT 18:04 UTC+2: Hypervisor is up again. Services are currently restarting.
EDIT 18:25 UTC+2: Hypervisor services are all up since a few minutes. Add-ons should now be reachable. Applications of owners using the FSBucket server that is hosted on this hypervisor will be redeployed. Since there is a huge number of applications, you can deploy them on your end directly if needed. We will continue to monitor the situation.
EDIT 19:10 UTC+2: The situation seems to be back to normal. We will investigate further why this hypervisor became unresponsive. If you still have any issues, please contact our support team.
We lost a server hosting FS buckets. server up and running
A FS Bucket machine has become unreachable from 2:59 PM to 3:01 PM today. It has been rebooted and is now available.
An reverse-proxy for add-ons has become unreachable from 3:53 PM to 3:59 PM. It was rebooted.
July 2022
We are experiencing slow down of deployment, we have identified the root cause and are working on the solution.
EDIT: 00:54 Issue has been resolved, deployments must be worked normally