Incidents

Full history of incidents.

September 2024

Fixed · Infrastructure · Global

Wed Sep 18 22:22:29 2024 UTC: Several hypervisors have been rebooted in WSW. They came back 40min ago, and we are fixing several services who are not online.

EDIT Wed Sep 18 22:30:47 2024 UTC: we have been impacted by https://bare-metal-servers.status-ovhcloud.com/incidents/j7f4kpv9f17z. All services are now online

[Paris] Network upgrade 1 year ago

Fixed · Global

On September 18, 2024, our network provider will carry operations to improve network resiliency on the Paris region. No service interruption is to be expected during that upgrade. This is a follow up of https://www.clevercloudstatus.com/incident/893.

Start date: 2024-09-18 19:00 UTC

End date: 2024-09-18 23:00 UTC

EDIT 2024-09-18 19:18 UTC: The maintenance is starting.

EDIT 2024-09-18 20:53 UTC: The maintenance is now over. No service interruptions noted.

WSW region hypervisors rebooted 1 year ago

Fixed · Infrastructure · Global

At 06:41 UTC, we got an alert that all the WSW region stopped responding. At 06:44 UTC, we got hold on the hypervisors. The first check showed they had been rebooted. At 06:50 UTC, all customers services were up and running. At 07:15 UTC, we finished all the checks that the region is fine.

Here’s the matching OVHCloud status: https://bare-metal-servers.status-ovhcloud.com/incidents/hw285l60sq7h It looks like an electrical incident happened on the racks that hold our servers.

[SGP][SYD] Network latencies 1 year ago

Fixed · Infrastructure · Global

Our monitoring system has report us high latencies to interact with SYD and SGP region. We are investigating the issue.

EDIT 08:50 UTC : The latencies goes back to normal, we are still watching the issue.

[Paris] Network Issues 1 year ago

Fixed · Infrastructure · Global

We are experiencing network issues on the Paris region and are working to identify them.

EDIT 18:21 UTC: the situation seems back to normal. We are still working to identify the reasons;

EDIT 18:23 UTC: we are working to restore impacted components.

EDIT 18:28 UTC: while preparing an intervention in one of our data centers in Paris, we encountered an unfortunate network rerouting. Services are now fully operational again.

EDIT 20:40 UTC: Updated wording to include "Paris region" for impacted location.

Deployments: build cache upload failures 1 year ago

Fixed · Deployments · Global

The build cache upload of deployments has an elevated error rate since 19:05 UTC. The root cause has been identified. This may prevent your deployments to finish correctly.

EDIT 22:25 UTC: The service is now fully operational again. Builds that failed because of this issue should be restarted. Please contact our support team if you need any assistance.

[Paris] Network upgrade 1 year ago

Fixed · Global

On September 11, 2024, our network provider will carry operations to improve network resiliency on the Paris region. No service interruption is to be expected during that upgrade.

Start date: 2024-09-11 19:00 UTC

End date: 2024-09-11 23:00 UTC

EDIT 2024-09-11 19:36 UTC: The maintenance is starting.

EDIT 2024-09-11 23:00 UTC: The maintenance is now over. No additional impact besides the ones described in the following incident: https://www.clevercloudstatus.com/incident/895

A hypervisor on MTL region just rebooted itself 1 year ago

Fixed · Infrastructure · Global

At 17:00 UTC, a hypervisor (hv-mtl2-012) stopped responding. The on-call team got an alert and starting the investigation. It seems that the hypervisor just rebooted itself.

We are trying to find the reason and making sure that all the services on that server restarted correctly.

UPDATE 17:24 UTC: the team just finished checking all the services: they are now up and running.

update: OVHCloud’s status confirms what we saw (server rebooting for no reason): The problem impacts other servers (not ours) as well. Fortunately for us, we made sure to avoid choosing our OVH servers in the same racks. We’ll wait for the result of their investigation.

UPDATE 2024-09-05 08:55 UTC: The incident has been resolved on OVH side.

August 2024

MTL: MySQL and PostgreSQL DEV clusters unavailable 1 year ago

Fixed · Global

Due to a maintenance from our infrastructure provider, the MySQL and PostgreSQL DEV clusters of the Montreal (MTL) region will be unavailable on Tuesday, September 3, 2024 starting at 12:00 UTC.

The maintenance is expected to take around 1 hour. During that time, the MTL MySQL and PostgreSQL DEV add-ons will not be available.

This incident will be updated to reflect the maintenance status.

[30/08/2024 15:00 CET] Both cluster are available

MTL: FSBuckets maintenance 1 year ago

Fixed · Global

Due to a hardware maintenance from our provider planned in the next few days, we will need to migrate the FSBucket service of the Montreal (MTL) region on Monday, September 2, 2024 starting at 08:00 UTC.

The maintenance is expected to take less than 1 hour. During that time, the FSBucket service will be read-only. Write operations will be denied. Read operations will continue to work as expected.

All applications linked to an FSBucket add-on on the Montreal region will be redeployed so they can reconnect to the server with read/write rights.

This incident will be updated to reflect the maintenance status.

EDIT 2024-09-02 08:08 UTC: The maintenance is starting. FSBucket are now read-only.

EDIT 2024-09-02 08:24 UTC: Applications are redeployed and should now be able to access their FSBucket.

EDIT 2024-09-02 09:10 UTC: All applications have been redeployed since 08:40 UTC and the maintenance is over. We are still having an issue with the web interface, we are looking into it.

EDIT 2024-09-02 12:16 UTC: The web interface issue has been fixed.

MTL: Git repositories maintenance 1 year ago

Fixed · Global

Due to a hardware maintenance from our provider planned in the next few days, we will migrate the Git repositories service of the Montreal (MTL) region on Friday, August 30, 2024 starting at 08:00 UTC.

The maintenance is expected to take less than 1 hour. During that time, the Git repositories service will be read-only. Git push operations will be denied. Pull operations will continue to work as expected.

This incident will be updated to reflect the maintenance status.

EDIT 2024-08-30 08:30 UTC: The maintenance is now over. Applications Git deployment URL have changed from push-n1-mtl-clevercloud-customers.services.clever-cloud.com to push-n2-mtl-clevercloud-customers.services.clever-cloud.com. SSH identity should be the same. Using the old domain will keep working for backward compatibility.

A hypervisor on gra1hds is not responding properly 1 year ago

Fixed · Infrastructure · Global

A hypervisor is not responding. A VM seems to be stealing all the cpu.

We are force rebooting this hypervisor.

21:09 status: the server refuses to reboot. We asked the OVHCloud support for help.

A technician is having a look at that server. We are waiting for the result of their analysis.

21:33 status: the technician came back to us and signaled a hardware issue. We are waiting for further update and actions.

2024-08-27 07:15 : OVHCloud support finished replacing the motherboard and give us back the server. It fails to reboot outside of rescue. While some are working on getting the kernel to boot, others are moving all the data outside to restore the impacted services for our customers.

09:50 : all services are back up and running for our customers.

Add-ons Reverse-proxies partially down 1 year ago

Fixed · Reverse Proxies · Global

(Times are in UTC)

At 14:24, two of the add-ons reverse proxies of the PAR region stopped responding. After investigation, we found out that the two failed to reconfigure correctly, due to a "stucked" port: the port was considered still used and fail to switch between the old process and the new.
At 14:34, we decided to fully reboot these two reverse proxies. It successfully fixed the issue.

The consequence of this incident is that some applications that were trying to use one of these two reverse proxies (of a total of 7 proxies) lost their connection to the database for 10 minutes.

Read errors on telemetry cluster 1 year ago

Fixed · Infrastructure · Global

The monitoring has detected errors on read queries of the telemetry cluster. We are investigating.

EDIT 21:30 UTC : We found out that the issue is related to indexes of the time series database, we are investigating the reason of the error.

EDIT 21:40 UTC : Some indexes had errors and have been rebooted, the estimate time to recover indexes is around 01:00 UTC.

EDIT 01:00 UTC : Indexes are still rebooting, the new estimate time is 03:00 UTC.

EDIT 02:47 UTC : Indexes are back online and query is available.

EDIT 07:30 UTC : We are running some maintenance operation, the query may be hanging a bit.

EDIT 08:00 UTC : We have shutdown the query to get some place to our maintenance query to run as fast as possible. We have found the root cause issue and we are fixing it, but to resolve read errors, we also need to achieve some clean up in parallel.

EDIT 09:40 UTC : We have turn on the query again, we have still maintenance queries running in the background.

EDIT 13:00 UTC : We have turn off the query, we are struggling the reads with the maintenance queries. To reduce the time of the recovery process, we took the decision to shutdown the read queries to keep the maximum compute space to the maintenance ones.

EDIT D+1 08:00 UTC : We have turn on the query again, the maintenance queries has finished during the night.

[MTL] Unreachable hypervisor 1 year ago

Fixed · Infrastructure · Global

We are investigating the loss of an hypervisor on the MTL region.

EDIT 16:36 UTC+2: The machine seems to have an hardware problem. Our provider is investigating the issue.

EDIT 17:36 UTC+2: We've been informed that this server was concerned by this maintenance: https://network.status-ovhcloud.com/incidents/ldl56trpj3kk. We are looking at how much time they need to complete this maintenance.

EDIT 17:48 UTC+2: The hypervisor has been rebooted by OVH. We are currently checking its state and restarting services.

EDIT 18:03 UTC+2: The incident is now over.

OVH Regions are impacted with the provider network backbone issues 1 year ago

Fixed · Global

We may be impacted by https://network.status-ovhcloud.com/incidents/nnhpfdw50vsn which we are investigating. Only OVH regions based services are concerned.

Update 13:33 UTC: we are indeed impacted by OVHcloud's Backbone incident. Some network routes cannot reach OVHcloud's datacenters. We are working on it. More info can be found on https://x.com/olesovhcom/status/1819742478586528146

Update 14:42 UTC: network seems more reliable now. We are still watching the network links

Update 15:16 UTC: The services are getting operational according to OVHcloud and we are not seeing network issues anymore.

Premium astreinte telecom is unreliable 1 year ago

Fixed · Customer support · Global

Phone number for on duty call of some customer experience a problem in our provider of telecommunication subsystems. Phone rings, but after there is impossible to talk on the phone. Customers with a problem, need to send directly by mail support@clever-cloud.com And they will be called back.

Global outage 1 year ago

Fixed · Infrastructure · Global

We are experiencing a global outage. We observed a network split in addition to an event bus outage. The effect has been inpactful for some core services.

EDITS :

2:00 PM CEST - Core services are being recovered and Deployments are being reloaded. This will synchronize back load balancers for customer's application trying to reach their new deployments.
2:08 PM CEST - Some services are being shut to accelerate the recovery process. Expect disturbed experience for observability and deployments for a few minutes
2:29 PM CEST - Criticial Core services are OK. Deployments are being rolled out.
3:07 PM CEST - Some workload queues have still difficulties to be processed. Some components may still be in an unstable state. Current effort is to identify them, then reload them.
3:40 PM CEST - Some hypervisors have experienced some crashes. Recovery process is occuring and will take a couple of minutes
3:56 PM CEST - Some hypervisors seems still experiencing network issues.
4:16 PM CEST - Apps are being deployed for premium customers. All apps are going to be deployed. Anyone can accelerate the process for its own application by manually deploying them.
4:24 PM CEST - In the meantime, we continue to identify noisy VMs that have been impacted by the outage
5:15 PM CEST - Metrics API is being restarted.
6:20 PM CEST - Last deployments are being rolled out. Reminder : accelerate by triggering a redeploy action
6:30 PM CEST - Still a few hundreds of VMs are consuming very high CPU rates and being cleaned.
6:35 PM CEST - We estimate approximately 40min to have full recovered all deployment of applications (MANUALY REDEPLOY FOR FASTER RECOVERY)
7:05 PM CEST - All IPSec links should be back online

Access logs ingestion and processing unavailable 1 year ago

Fixed · Access Logs · Global

Following https://www.clevercloudstatus.com/incident/877, we have difficulties to process access logs, you may observe holes and lags.

Deployment failure are observed in PAR 1 year ago

Fixed · Deployments · Global

Following https://www.clevercloudstatus.com/incident/877, some deployments are failing. We currently working on a solution.

EDIT: 10H31 UTC - A workaround has been found to ensure that deployments work again