Incidents

Full history of incidents.

Oldest first

June 2023

[PAR] An hypervisor is unreachable 2 years ago

Fixed · Infrastructure · Global

The monitoring system has detected that an hypervisor is unreachable. We are investigating.

EDIT 08:32 UTC : We have found the issue and the hypervisor is rebooting

EDIT 08:50 UTC: The hypervisor has finished to reboot and services is working

[PAR] An hypervisor rebooted 2 years ago

Fixed · Infrastructure · Global

An hypervisor rebooted on the Paris zone. Impacted applications are redeployed on other servers. We are monitoring the situation.

EDIT 11:40 UTC: All impacted applications have been redeployed automatically. We will investigate further why this server rebooted. The incident is now over.

Metrics system write is slow 2 years ago

Fixed · Access Logs · Global

Our metrics system's hbase cluster is in an inconsistent state. We found out which nodes are responsible for it and are fixing them.

12:26 UTC: we restarted the node responsible for the issue. While it re-converges, we stop the egress servers. We will put them back on in a few minutes.

13:31 UTC: Query is back online. We are still catching up the lag, so new datapoints may not be available

14:35 UTC: lag has ben catched up

Metrics and access logs storage layer unreachbility 2 years ago

Fixed · Access Logs · Global

Our monitoring has detected failure on the storage layer of metrics and access logs. We have found that a storage node has lost several disk. We have remove faulty disks and restarted the storage node.

EDIT 16:00 UTC : The storage layer is restarted and we are consuming the ingestion lag

[RBX] A hypervisor has rebooted 2 years ago

Fixed · Infrastructure · Global

2023-06-07 08:56 UTC: A hypervisor on the RBX zone has rebooted.
09:00: the machine has fully rebooted, it is restarting all its VMs. Applications VMs are redeploying on other hypervisors.
09:31: the checks are done, everything seems to be running fine as of now.

We will investigate to understand why this hypervisor rebooted in the first place.

[JED] Load balancers metrics show abnormal response status code 2 years ago

Fixed · Reverse Proxies · Global

Monitoring of load balancers is detecting an abnormal amount of http 404 status. We are investigating.

EDIT 13:00 UTC : We have located the root cause, we are applying a fix.

EDIT 14:20 UTC : The issue is resolved

[RBX] lost connectivity with an hypervisor 2 years ago

Fixed · Infrastructure · Global

We lost connectivity with an hypervisor on RBX. Applications have been redeployed but some databases may not be reachable. We are investigating.

EDIT 03:58 UTC: server is back online. All databases should now be reachable.

Metrics/access logs storage layer issue 2 years ago

Fixed · Access Logs · Global

We are detecting some errors on our storage layer responsible for storing metrics and access logs data. We are investigating.

EDIT Lag has been catched up

[RBXHDS] Load balancers metrics show abnormal response status code 2 years ago

Fixed · Infrastructure · Global

Monitoring of load balancers is detecting an abnormal amount of http 404 status. We are investigating.

EDIT 17:51 UTC : We have found the issue and the fix is passed. Everything is operating normally

[RBXHDS] Cellar Load Balancers partially unavailable 2 years ago

Fixed · Cellar · Global

2023-06-01 16:20 UTC : During the RBXHDS incident, one of the Cellar LB lost its configuration. The configuration of each LB was not correctly monitored. Only the whole service availability was.

2023-06-02 09:15 UTC : after customer complaints we found out about the LB misconfiguration and fixed it.

2023-06-02 09:28 UTC : monitoring checks have been added to catch this kind of issues right away.

Ticket center availability issue 2 years ago

Fixed · Customer support · Global

We are currently aware of an issue impacting our Ticket center service. This may impact our customers to open, view and reply to the tickets opened with our support team.

EDIT 13:30 UTC: Our ticket center provider told us that the issue has been mitigated on their end and that it is now resolved. We keep monitoring the situation for now but we can indeed see that service are operating normally those last few minutes.

EDIT 14:47 UTC: We did not see any other issues. We consider this incident to be over.

May 2023

The metrics storage layer is unavailable 2 years ago

Fixed · Access Logs · Global

The monitoring detect errors on the metrics / access logs storage layer. We are investigating.

EDIT 11:46 UTC : We have found the issue and fixed it. We are recovering the lag.

EDIT 13:19 UTC: The lag has been consumed, everyhting is operating normaly

[Montreal] Multiple hypervisors are unreachable 2 years ago

Fixed · Infrastructure · Global

An hypervisor on the Montreal zone is unreachable. One of the FSBucket servers of the zone is hosted on it and is therefore unreachable too. This might impact PHP applications as well as any applications using an FSBucket hosted on this server.

We are awaiting information from our infrastructure provider regarding this incident.

EDIT 19:53 UTC: It seems like multiple servers are impacted at the same time, we believe it to be an issue with a specific OVH rack or room. Multiple services on the zone are thus impacted. We are looking at ways to mitigate the issues.

EDIT 20:05 UTC: The servers are reachable again since a few minutes. We are currently making sure everything is fine. OVH incident can be followed here: https://bare-metal-servers.status-ovhcloud.com/incidents/k664s90jxfj0

EDIT 20:15 UTC: Servers in the impacted rack couldn't reach each other up until now. It could have prevented some services to correctly work. It seems like OVH fixed it before we could report it to them. We continue to making sure everything is working as expected.

EDIT 20:36 UTC: The incident is over. We are redeploying all the applications of the zone to be on the safe side.

Metrics: Ingestion issue leads to missing data points 2 years ago

Fixed · Access Logs · Global

We are currently having an ingestion issue on our metrics cluster. The root cause has been identified and we are currently working on a fix. Until this incident is fixed, metrics data points might be missing from your metrics dashboards. Access logs are also impacted but will be re-queued later.

EDIT 14:14 UTC: Metrics ingestion is now back to normal. Access logs are being re-queued and are currently lagging a bit.

EDIT 14:20 UTC: Access logs have been ingested and are now up-to-date. The incident is now over.

EDIT 16:25 UTC: The problem came back, we are working on it.

EDIT 16:56 UTC: The problem is now solved again. Another root cause has been identified and has been fixed.

Cellar network is slow 2 years ago

Fixed · Cellar · Global

We are encountering slowness on the Cellar infrastructure. We are investigating why.

EDIT 15:05 UTC: The issue has been found and fixed. Performance went back to normal around 13:45 UTC. Additional measures will be taken to avoid this issue in the future.

Add-ons' reverse proxies break some connections 2 years ago

Fixed · Reverse Proxies · Global

Users reported issues while connecting to their database. We are investigating.

09:30 UTC : A huge number of add-ons recently created by malicious users was detected. It was issuing a lot of configuration changes on our reverse proxies, making them unstable.

We banned those users and are watching the situation closely.

MongoDB shared cluster is experiencing high load 2 years ago

Fixed · MongoDB shared cluster · Global

MongoDB shared cluster for free addons seems to be under heavy load. We are investigating.

Slowness in deployments 2 years ago

Fixed · Deployments · Global

Deployments services are experencing an abnormal load. We have identified the root cause and are fixing it.

12:20 UTC: The deployments are still running slow. We are still cleaning the situation.

13:16 UTC: we have found a deployment loop with the monitoring. We are stopping it…

13:51 UTC: cleaning is done, we are watching to see if deployments are running as expected

14:00 UTC: we have found an abnormal behaviour, we are investigating

D+1 14:30 UTC: we have made a patch for the abnormal behaviour and we are watching deployments

April 2023

Clever Cloud API Major update 2 years ago

Fixed · Global

Maintenace is now over

Thursday 27 of april at 2:00 PM CEST (12:00AM UTC) we will apply a major update concerning Clever Cloud APIs.
This update prepares some work for future and actual services.

Are you concerned?

All Clever Cloud public regions are concerned. Gov and Private regions aren't concerned, neither On Premise regions.

What's the expected behavior during the maintenance window?

All Applications and Cloud Services will continue to run as expected.
Some APIs calls may be delayed or refused for a few minutes. Deployments may take a bit longer than expected.

We expect services to be fully operational for 3PM CEST.

What do I have to do?

If you're driving your scalability, please anticipate your requirements to be fulfilled by 1PM CEST, since autoscaling won't be as reactive during the maintenance window.
--
We will keep you posted with the process in here and via this twitter thread

[PAR-SCW][RBX] Connectivity issues with load balancers 2 years ago

Fixed · Infrastructure · Global

The monitoring has detected a burst of connections on regions PAR-SCW (Paris Scaleway) and RBX (Roubaix). Applications may have experienced deconnections and blocked new connections.

EDIT 20:18:00 UTC : The issue is mitigated and we are watching

EDIT 20:50:00 UTC : Everything goes to habitual levels