Incidents
Full history of incidents.
February 2021
On 2021-02-22 at 11:00 UTC, the API and deployment system will go down for a quick maintenance. Expected downtime is up to 10 minutes.
11:00 UTC: Maintenance is starting. Deployments are disabled.
11:02 UTC: API is down.
11:11 UTC: API and deployments are up again. Maintenance is over.
Some Fs-Bucket add-ons will need to be migrated to a different server for security reasons. During this migration, the Buckets will be in Read-Only mode. Any attempt to create or update a file on the add-on will fail, including for FTP operations. Errors related to Read-only file system are expected during this migration.
The migration is expected to last at most 1 hour. All impacted applications will be redeployed during the migration. After the deployment, application will be able to write to the bucket. Read operations will not be impacted.
Emails will be sent to customers of the impacted add-ons.
EDIT 12:00 UTC+1: The maintenance will begin shortly
EDIT 12:04 UTC+1: The buckets are now read only
EDIT 12:13 UTC+1: The redeployment queue began, it should not last more than 15 minutes.
EDIT 12:51 UTC+1: The maintenance is over, the queue ended 20 minutes ago and everything seems to be normal.
We are investigating performance issues with our API.
10:44 UTC: We have found the cause and fixed the issue. It was due to an internal tool unexpectedly making too many costly requests.
We are investigating an issue with Metrics ingestion. Recent data is unavailable at this time.
EDIT 13:18 UTC: Ingestion is working again, working at full speed to catch up.
EDIT 14:03 UTC: Ingestion has caught up since a few minutes ago, everything should be back to normal.
Some Fs-Bucket add-ons will need to be migrated to a different server for security reasons. During this migration, the Buckets will be in Read-Only mode. Any attempt to create or update a file on the add-on will fail, including for FTP operations. Errors related to Read-only file system are expected during this migration.
The migration is expected to last at most 1 hour. All impacted applications will be redeployed during the migration. After the deployment, application will be able to write to the bucket. Read operations will not be impacted.
Emails will be sent to customers of the impacted add-ons.
EDIT 11:55 UTC+1: The maintenance will start on time.
EDIT 12:00 UTC+1: The maintenance is starting
EDIT 12:07 UTC+1: Applications are being restarted. The restart queue should be done in about 20 minutes
EDIT 12:41 UTC+1: The migration is over.
We are experiencing an issue with the storage of application logs. Ingestion is down and read access is partially unavailable.
13:06 UTC: The issue has been solved, ingestion is catching up.
13:10 UTC: Ingestion is all caught up. This incident is over.
Our main API is unresponsive, therefore the console and CLI are unusable as well. We are investigating.
13:53 UTC: The issue is fixed. Everything is back to normal.
An add-on reverse proxy has restarted at 14:04 UTC, leading to connections loss on some add-ons if you used that proxy. Impacted application might have been able to connect to the add-on through a different reverse proxy, unreachable applications will be redeployed.
January 2021
One of the IP (149.56.147.232) of domain.mtl.clever-cloud.com is unreachable because OVH blocked it. We are working on restoring it.
EDIT 18:07 UTC: The IP has been restored. OVH blocked it after a 4 hours email notice of phishing which has escaped our own filters. Further investigations will be conducted to avoid this incident in the future.
Logs are currently delayed and may not be up-to-date. The queue is being consumed. Some messages may have been lost during because of an unexpected service reboot. Logs queries are still working.
EDIT 13:26 UTC: The queue has been consumed. Logs should now be up-to-date.
A FS Buckets server has crashed and failed to automatically restart. An issue was preventing it from properly restarting. It is now fixed.
This server has been unavailable for 8 minutes.
Our shared postgresql leader is currently crashing repeatedly and entering recovery mode, we are investigating what is causing this issue.
Dedicated addons are NOT impacted.
We are experiencing issues with hypervisors. We are investigating.
EDIT 15:45 UTC: Two hypervisors went down. The impacted services are:
-
Add-ons -> add-ons hosted on those servers are currently unavailable
-
Applications -> applications that were hosted on those servers should be redeployed or in the redeploy queue
-
Logs -> new logs won't be processed. This includes drains. You might only get old logs when using the CLI / Console
-
Shared RabbitMQ -> A node of the cluster is down, performance might be degraded
-
SSH -> No new SSH connection can be made on the applications as of now.
-
FS Bucket: a FS Bucket server was on one of the servers. Those buckets are unreachable and may timeout when writing / reading files
EDIT 15:54 UTC: Servers are currently rebooting.
EDIT 15:59 UTC: Servers rebooted and the services are currently starting. We are closely monitoring the situation.
EDIT 16:07 UTC: Services are still starting and we are double checking impacted databases.
EDIT 16:11 UTC: Deployment might take a few minutes to start due to the high deployment queue.
EDIT 16:33 UTC: Most services should be back online, including applications and add-ons. The deployment queue is still processing.
EDIT 16:45 UTC: The deployment queue is now empty since a few minutes, all deployments should go through almost instantly.
EDIT 17:13 UTC: Deployment queue is back to normal.
EDIT 17:15 UTC: The incident is over.
We have detected an issue affecting our logs collection pipeline. New logs are not being ingested. We are investigating.
15:52 UTC: The issue has been identified and should be fixed. We are monitoring things closely.
16:11 UTC: Overall traffic in the logs ingestion pipeline is not completely back to normal. If one of your applications does not have up-to-date logs you can try to restart it.
16:32 UTC: We have forced the hand of a component of the ingestion pipeline making it catch up with the logs waiting in queue. It should go back to normal in a matter of minutes now.
We are investigating performance issues with the API and console. This issue seems to be caused by our dedicated reverse proxies (which do not affect the performance nor availability of our customers' applications).
While investigating the issue, something broke in one of the reverse proxies which is causing availability issues. We are working on this.
10:25 UTC: The availability issue has been resolved. We are still working on resolving the performance issue.
10:32 UTC: We found the culprit and have implemented a work-around. Performance is back to normal. We are still working on an actual fix.
Our pulsar cluster is currently having issues, we are investigating the impact it may have on the cluster's usage and how to resolve them.
EDIT 14:03 UTC: The problem is now resolved. Some connection issues happened but a retry would have worked.
16:08 UTC: The API is rejecting several deployment requests 16:10 UTC: Everything is back to normal
One of the reverse proxy was unreachable for 30 minutes on the OVH zone, this was due to an OVH networking issue. This is now fixed.
Redsmin is currently unavailable due to an expired TLS certificate. Redsmin owners have been notified, we are waiting for them to update the certificate.
EDIT 22:30 UTC: Redsmin owners updated the certificate. Redsmin should now be available again
A 15 minutes maintenance on OVH's side is planned at 06:00 UTC-5. Network might be lost during the maintenance. Only one server is going to be impacted. Applications will be redeployed. OVH status: http://travaux.ovh.net/?do=details&id=48360
EDIT 11:02 UTC: The server currently has no network. Add-ons hosted on it are currently impacted.
EDIT 11:16 UTC: The network has come back. Waiting for OVH confirmation on the end of the incident.
EDIT 11:19 UTC: OVH closed the incident, everything should be back to normal.