Incidents
Full history of incidents.
November 2024
We experienced a small fiber cut between two AZ of the Paris region between 10:06 and 10:07 UTC. While the impact should have been minimal, it seems to have not been the case. Some services might have encountered timeouts or errors when reaching them up until a few minutes after the cut happened.
We are investigating the issue.
EDIT 11:50 UTC: We believe to have found the underlying issue. Some network routes were still stale and tried to use the dead links for multiple seconds. This lead to the timeouts or various services degradation. We will update our configurations to avoid this issue in the future.
We currently see some troubles to reach some add-on with their domain name. We are currently investigating.
This is mainly impacting:
- Elasticsearch and Jenkins add-ons on several of our regions
[ 18h30 CET] we found the root cause and we are currently deploying a patch we continue to monitor the situation
October 2024
We are currently observing increased packet loss and network latency for OVH hosted regions. We are investigating the issue. Access to our services on the following regions may be impacted as well as normal platform operations on those regions (deployments, logs, metrics, ..): RBX, RBXHDS, GRAHDS, WSW, MTL.
EDIT 13:43 UTC: We are seeing improvements since 13:33 UTC but some FAI are still having troubles accessing OVH based services.
EDIT 13:54 UTC: The incident has been posted on OVH status: https://network.status-ovhcloud.com/incidents/qgb1ynp8x0c4. We are following its updates. Currently it seems like all access are restored but we won't close this incident until OVH acknowledges it is over on their side.
EDIT 15:50 UTC: The incident has been closed on OVH side.
Pydio view in Console show error depending on your browser (issue with Chrome-based and Safari). We are investigating the issue.
Current workaround: it is know to work on Firefox
PGStudio view in Console do not work depending on your browser (issue with Chrome-based and Safari). We are investigating the issue,
Current workaround: it is know to work on Firefox
[20:30 UTC]: You can now log-in with Chrome-based and Safari navigator. To work properly you need to enable third-party cookies, see doc for Firefox, Chrome and Safari
At 09:30 UTC, we received an alert from the monitoring about deployments not going through. We checked and saw that the orchestrator failed silently in a way that wasn’t detected by the service controller. We restarted the orchestrator and the deployments started to be handled again. 09:33 UTC: end of alert.
We will find a way to detect this kind of failure and handle it before the deployments get blocked.
PHP My Admin view in Console show error depending on your browser (issue with Chrome-based and Safari). We are investigating the issue.
Current workaround: it is know to work on Firefox
Users need to enable third-party cookies so it work properly again
Following yesterday's incident (https://www.clevercloudstatus.com/incident/911), we took actions to solve the root cause issue.
When we are performing some actions the ZooKeeper cluster becomes unstable and fails. Access logs, logs and deployments stack are affected as well as all services interacting with the Pulsar cluster.
16:20 UTC : The cluster pulsar is up and running, the deployment stack, logs is running as well, we are restarting the access logs stack.
16:30 UTC : The access logs stack is up and running.
The monitoring report that pulsar is in a unhealthy state, we are investigating.
16:38 UTC: there seems to be an inconsistency in the underlying bookkeeper cluster. We are looking into it.
16:40 UTC: we are now looking into the zookeeper service that seems to fail.
17:30 UTC: we have fix the zookeeper issue, and we begin the recovery process of the cluster bookeeper and then pulsar.
18:10 UTC : we are rolling open the access to the pulsar cluster.
18:45 UTC : we have rolled open the access to the pulsar cluster to half of our hypervisors.
19:15 UTC : the pulsar cluster is running and available for everyone. We are running the recovery process of the platform to ensure that every applications is up and running as well.
21:30 UTC : we have finished to redeploy applications. We are investigating the access logs stack that got offloaders errors on pulsar-side.
22:10 UTC : we have finished to restart the access logs stacks.
For security reasons, we will update the kernel of 4 Hypervisors in the Paris (PAR) region, more precisely in the PAR6 datacenter. Services (in particular databases) hosted on those hypervisors will be impacted : they will be unavailable between 5 and 10 minutes. Impacted hypervisors are:
On Wednesday 20 November
- hv-par6-012
- hv-par6-020
On Thursday 21 November:
- hv-par6-008
- hv-par6-011
Affected clients are directly and individually contacted by email with the list of impacted services, and options to avoid any impact. The maintenance will be planned in 2 operations of 2 hypervisors each, during the week of 18 to 22 Novembre 2024 between 22:00 and 24:00 UTC+1.
EDIT 2024-11-20 22:30 UTC: Both hypervisors were rebooted. All services are available again.
EDIT 2024-11-21 22:00 UTC: Both hypervisors were rebooted. All services are available again. This maintenance is now over.
An hypervisor located in the PAR zone seems unreachable, we are investigating.
EDIT Thu Oct 10 18:33:25 2024 UTC: HV is back online, and all related services.
We are investigating network instabilities that happened between 15:21 UTC+2 and 15:23 UTC+2 on the Paris region. During that time, you may have encountered timeouts to join services hosted on the region. Service is currently operational.
EDIT 20:00 UTC+2: The instabilities were due to a sudden increase of traffic towards the region.
The certificate of cleverapps.io has not been properly renewed at 12h06 UTC. A manual regeneration of the certificate in on the way.
EDIT: The certificate has been renewed at 12h33 UTC, it has been applied and propagated to all load-balancers.
A scheduled network maintenance will be carried out in the Paris region on Wednesday, October 2, 2024. This upgrade will affect non-production links, and no impact on production systems is expected.
Start Date & Time: 2024-10-02 20:00 UTC
End Date & Time: 2024-10-02 21:00 UTC
We will provide regular updates throughout the maintenance period.
EDIT 20:38 UTC: The maintenance is now starting.
EDIT 21:45 UTC: The maintenance is still ongoing. Most of the operations are over, verification are currently taking place.
EDIT 22:20 UTC: The maintenance is now over. No impact detected.
We are experiencing issues with the deployment pipeline.
EDIT 12:43 UTC: the system has returned to normal operation. Our team is continuing to investigate the root cause to ensure stability moving forward. Further updates will be provided as necessary.
EDIT 13:12 UTC: fixed.
September 2024
Following a maintenance operation to reduce load on pulsar cluster. The cluster has an the issue with some configurations, we are investigating the reason.
We have detected some latency and a few instabilities to connect to our platform, we are investigating.
EDIT 17h40 - Root cause has been identified, and network is now stabilized. We are closely monitoring the platform to be sure this incident is closed
We detected an issue on log reads.
EDIT 13:00 UTC: identified and patched. We are currently deploying the fix.
EDIT 13:15 UTC: fixed.
We are experiencing a pulsar outage, which impacts logs and access logs and other components of the platform. Preliminary root cause seems like a zookeeper problem. We are working on it.
EDIT Fri Sep 20 18:16:00 2024 UTC Deployments have been disabled. We are still investigating the Zookeeper outage, causing Pulsar outage.
EDIT Fri Sep 20 19:57:09 2024 UTC: The zookeeper quorum is back online, and therefore Pulsar. Deployments have been enabled, we are watching the situation.
EDIT Fri Sep 20 22:09:40 2024 UTC: Pulsar cluster is still unstable, deployment have been disabled.
EDIT Fri Sep 20 23:36:10 2024 UTC: Deployments queue is back, we are ramping up logs's data usage to avoid bursting Pulsar too much.
EDIT Sat Sep 21 00:49:10 2024 UTC: Pulsar cluster is now stable. Applications should now have their logs available in the console / CLI as well as the drains. Access logs lag is currently catching up. We continue to monitor the situation.
We are experiencing a pulsar outage, which impacts logs and access logs and other components of the platform. Preliminary root cause seems like a zookeeper problem. We are working on it.
EDIT Thu Sep 19 20:49:09 2024 UTC: since 20:20, ZK quorum is up, and all services connected to Pulsar are now back online
EDIT Fri Sep 20 07:43:00 2024 UTC: we are still impacting by zookeeper outage, we are investigating the issue, the logs and access logs stack are currently unavailable
EDIT Fri Sep 20 08:04:00 2024 UTC: we have found the issue on pulsar side that was trying to write indefinitely metadata on zookeeper. we have restarted the broker that had the issue. We are watching, the situation is going back to normal
EDIT Fri Sep 20 08:20:00 2024 UTC: we are still watching the metrics from the pulsar cluster, the situation is going back to normal. we are recoverying from lag on the access logs ingestion, current eta is around 12:30 utc.
EDIT Fri Sep 20 13:15:00 2024 UTC: we have fully ingested the access logs, the cluster pulsar is working normally.