Incidents

Full history of incidents.

December 2021

An hypervisor is unreachable in RBX zone 4 years ago

Fixed · Infrastructure · Global

An hypervisor is unreachable in the RBX zone. Affected applications are being redeployed automatically. Affected add-ons are unreachable.

13:40 UTC: Multiple servers in the same rack have gone down at the same time. It's most likely a network issue.

13:45 UTC: Our provider (OVHcloud) is aware of the issue. They will come back to us with more details later.

13:53 UTC: The hypervisor is back online. We are making sure everything is fine.

14:11 UTC: Everything is fine now, there was an issue with outgoing traffic from 13:53 until 14:08 UTC. This is now fixed.

Our provider tells us it was an issue with the cooling system. More info may be posted here: https://bare-metal-servers.status-ovhcloud.com/incidents/5cqtb0q9ht67

Logs ingestion issue 4 years ago

Fixed · Services Logs · Global

We are experiencing an issue with logs ingestion pipeline. We are looking into it.

EDIT 9h15 UTC : The ingestion pipeline is back to normal. No abnormal delay.

Metrics / Access logs are unavailable 4 years ago

Fixed · Access Logs · Global

Metrics and Access logs queries are currently unavailable. Data is still ingested, only queries are impacted. ETA for resolution is 15:00 UTC.

This impacts:

Metrics (Grafana, in the console or using or API)
Access logs: (requests tiles for an organization / application, CLI access or our API)

EDIT 14:52 UTC: The queries are available again since 14:20 UTC. This incident is over.

Elasticsearch - log4j CVE-2021-44228: Add-ons will be restarted to mitigate an information leakage 4 years ago

Fixed · Global

Elastic released a security bulletin regarding the impact CVE-2021-44228 has on Elasticsearch. Elastic recommends users to apply the -Dlog4j2.formatMsgNoLookups=true JVM option and restart Elasticsearch. More information in Elastic security bulletin: https://discuss.elastic.co/t/apache-log4j2-remote-code-execution-rce-vulnerability-cve-2021-44228-esa-2021-31/291476

We will apply this option on all add-ons and restart them as an emergency maintenance. For single node add-ons, this will trigger a short downtime of minimum 1 minute (the approximate time it takes Elasticsearch to boot). For clustered add-ons, no downtime is to be expected as it will be a rolling restart.

Newly created add-ons are already patched.

The restart of all add-ons will start at 15:00 UTC. Sorry for the short notice. Feel free to contact our support if you have any questions.

EDIT 15:05 UTC: Add-ons restart is starting

EDIT 16:10 UTC: Add-ons have been restarted. The maintenance is over.

Retroactive: [PAR] An add-on reverse proxy crashed 4 years ago

Fixed · Reverse Proxies · Global

An add-on reverse proxy crashed at 10:21 UTC and got restarted at 10:24 UTC. During that time, some services connecting to their add-ons might have experienced unexpected connection errors (connection lost, connection refused, ...).

The issue is now fixed.

Logs ingestion issue 4 years ago

Fixed · Services Logs · Global

We are experiencing an issue with logs ingestion pipeline. We are looking into it. The issue started at 23:30 UTC yesterday and was not caught until 08:02 UTC because of a missing monitoring alert following a maintenance operation a few days ago.

09:22 UTC: The issue is identified and fixed, logs ingestion should catch up. Logs should appear within a few minutes.

09:38 UTC: The issue is not actually fixed, there is something else blocking the pipeline. We are investigating.

09:55 UTC: The ingestion is working, there are a lot of older logs to be processed so it will take a while before you can see recent logs in real time.

13:07 UTC: The ingestion pipeline is back to normal. No abnormal delay.

SGP Zone Hypervisor is down 4 years ago

Fixed · Infrastructure · Global

One of our hypervisors in the Singapore zone is down.

November 2021

Erratic network behaviour to some hosts from some PAR hosts 4 years ago

Fixed · Infrastructure · Global

We are investigating a network issue. We are seeing random TCP timeouts and ICMP packets dropped for a few remote hosts from some PAR hosts (very few hosts are affected by this). This started occurring on 2021-11-25 at around 22:15 UTC.

10:53 UTC: We are still investigating this issue. The culprit seems to be a peering node.

11:18 UTC: It seems to only affect a few routing paths between our infrastructure and some hosts of Scaleway and Azure. We are trying to narrow down the issue with their network teams.

13:05 UTC: We see improvements between Scaleway and our Infrastructure since 11:26 UTC. We do not yet know if it's a temporary resolution and are awaiting for more information on Scaleway side.

13:36 UTC: Confirming that the issue between Scaleway and our infrastructure has been fixed. We are still awaiting some details from Scaleway to know if they are indeed the ones who changed their routing configuration to avoid the faulty peer.

15:10 UTC: Scaleway tells us they did not change anything on their end. Still, no issue to report on this side since 11:26 UTC. On the Azure side of things, it seems to be better, the issues we could reproduce earlier cannot be reproduced anymore but some hosts may still be affected. We are marking this as resolved but if you have any specific problems, please contact us so we can troubleshoot the issue more efficiently.

[PAR] Hypervisor faulty memory module replacement 4 years ago

Fixed · Global

One of our hypervisors needs to be shut down because of a faulty memory module. Applications have already been redeployed elsewhere and add-ons will be automatically migrated starting on December 1, 2021 at 20:30 UTC+1. Add-ons that can't be migrated will experience up to 1 hour of downtime.

Impacted users will shortly receive an email and can contact us on our technical support for any further questions.

EDIT 20:32 UTC+1: Add-ons migrations are starting

EDIT 21:31 UTC+1: Add-ons have been migrated. Add-ons that couldn't be migrated in the first place will be unavailable up to one hour. We will announce the planned downtime tomorrow (02/12/2021)

EDIT 02/12/2021: The hypervisor will be rebooted on December 06, 2021 at 11:00 UTC+1. The expected downtime is less than 1 hour.

EDIT 06/11/2021 10:59 UTC+1: The hypervisor is going down at 11:00 UTC+1 as expected. Downtime should not be higher than 1 hour.

EDIT 06/11/2021 11:09 UTC+1: The hypervisor is back up since 3 minutes, all services should be reachable again. We are making sure everything runs fine.

EDIT 06/11/2021 11:13 UTC+1: The maintenance is over.

Logs ingestion issue 4 years ago

Fixed · Services Logs · Global

We are experiencing an issue with logs ingestion pipeline. We are looking into it.

12:21 UTC: Incident is resolved (there may be some lag for a few minutes)

PHP 7.0 to 7.2 will be removed due to end of life support 4 years ago

Fixed · Global

Applications using PHP 7.0 to 7.2 will be upgraded to PHP 7.4 automatically on December 1st, 2021.

PHP versions from 7.0 to 7.2 are vulnerable to security vulnerabilities as they will not receive security updates. You can find the list of end of life versions here: https://www.php.net/eol.php.

Affected customers will be e-mailed about this change and can contact our support team for any additional questions.

Access Logs and Billing are experiencing issues 4 years ago

Fixed · Access Logs · Global

Access Logs and Billing are experiencing issues.

EDIT 13:05 UTC: fixed.

October 2021

Metrics / Access logs queries unavailability 4 years ago

Fixed · Access Logs · Global

The Metrics / Access logs platform is currently having issues. We are investigating.

EDIT 11:00 UTC: A node from the cluster failed to reboot and was stuck in failed state. We are rebuilding this node. It will take 2 to 3 hours. No data will be lost.

Live logs unavailable 4 years ago

Fixed · Services Logs · Global

(Times in UTC) 09:15 - The RabbitMQ cluster handling live logs started to fail with the "logs" vhost. We start creating the vhost again. 2021-10-24 08:00 - We notice that parts of the logs system are still not working. We investigate them. The Logs API keeps crashing for no apparent reason.

11:45 - The Logs API stopped crashing. We don't know why and continue to investigate the reason to fix this for the long term.

Webhook and e-mail notifications delivered with a delay 4 years ago

Fixed · API · Global

Webhook and e-mail notifications have not been sent since 22:30 UTC on 2021-10-21. The notification service lost its connection to the message queue service and failed to reconnect automatically. This was due to a short network outage between our two Paris datacenters. This issue has been mixed in with others and left unnoticed.

At 11:12 UTC today, the queue has been emptied so all webhooks matching the events during this period have not and will not be sent out. Events from 11:12 to 12:25 UTC have all been sent at once and everything is back to normal since then.

Retroactive: [PAR] Add-on reverse proxy unreachable 4 years ago

Fixed · Infrastructure · Global

An add-on reverse proxy was unreachable on the PAR zone. Some applications might have had issues connecting to their add-ons or may have unexpectedly lost their connections to them.

The reverse proxy has been rebooted and this incident is now over.

[PAR] Hypervisor faulty memory module replacement 4 years ago

Fixed · Global

Impacted users will shortly receive an email and can contact us for any further questions.

EDIT 19/10 18:35 UTC: Migration of add-ons has started

Certificate issue with *.cleverapps.io 4 years ago

Fixed · Global

There is an issue with the certificate associated with the *.cleverapps.io domains, which has expired. We are renewing it ASAP.

10:40 UTC+2: The issue is resolved.

Metrics / Access logs queries unavailability 4 years ago

Fixed · Access Logs · Global

The Metrics / Access logs platform is currently having issues, queries are returning errors. We are investigating.

EDIT 15:07 UTC: The problem has been identified and fixed. Queries should now be back, current data lag is 1 hour and 30 minutes. It should quickly come down in the next hour.

EDIT 17:58 UTC: Ingestion lag is now resolved

Experiencing networking problem on our OVH-based infrastructure 4 years ago

Fixed · Global

We are experiencing networking issues with our OVH-based infrastructure, we are looking for more information from OVH.

https://twitter.com/ovh_status/status/1448185498812485633?s=20

The website travaux.ovh.com is unreachable preventing us from getting a status on the maintenance where "No impact" was expected.

09:55 UTC+2: We still have no update from OVH.

10:01 UTC+2: https://twitter.com/olesovhcom/status/1448196879020433409?s=20

10:20 UTC+2: Our Montreal zone is reachable, others zones might come back soon.

All our zones are now reachable, you might still experience DNS issues or other issues due to the OVH incident it self.