Incidents
Full history of incidents.
September 2018
One of the nodes of the shared rabbitmq cluster crashed. It's currently restarting.
EDIT 18:50 UTC: The node has successfully restarted, the cluster should now be operational as usual
August 2018
The main API is unavailable, the console cannot be loaded as well.
We are looking into it.
EDIT 15:30 UTC: Our API is back online. The console can now be loaded.
An hypervisor is unreachable, we are working on fixing the issue.
Applications on this hypervisor are being automatically redeployed. Add-ons are unreachable.
EDIT 12:21 UTC: The hypervisor is back online and is restarting the add-ons.
EDIT 12:32 UTC: All add-ons are now reachable.
Deployments will be interrupted during 30 minutes at 12:30 UTC+2 today. A core component upgrade will be performed. This will not impact already running applications or add-ons. All deployments will be queued and executed at the end of the maintenance.
The maintenance shouldn't last longer than 30 minutes but it may be possible that some delays occur. We will update this ticket to let you know about the status of the maintenance.
EDIT 12:25 UTC+2: New deployments are stopped to be consumed.
EDIT 12:30 UTC+2: The maintenance has started
EDIT 12:56 UTC+2: Deployments are back since ~10 minutes. We are still cleaning things up
EDIT 13:03 UTC+2: Maintenance is over and was successful. Do not hesitate to contact us if anything's wrong on your side.
Deployments are temporarily disabled as we fixed the issue with a component of the deployment system.
EDIT 19:17 UTC: This was actually a false positive from our monitoring. After verifying that the component is working fine and fixing the monitoring probe, we re-enabled deployments.
One FS Buckets server is unavailable, we are awaiting news from our provider.
EDIT 05:28 UTC: The server is partially and randomly available: the problem has been identified by our provider: it's coming from the switch the server is connected to. They are working on fixing the issue.
EDIT 08:04 UTC: Issue is fully fixed since 07:30 UTC
MongoDB cluster will not accept writes until failure is fixed.
The failing node is up again.
Creation of add-ons and buckets on cellar is temporarily failing. We are working on it
EDIT 15:30 UTC: The creation of add-ons and bucket is now fixed. It may take a little longer than usual but these slowness will be resolved in a few hours
There is an issue with the entry point to the cluster.
Users are stretching the "fair usage" concept way above reasonnable limits. We are working with them to enforce the fair usage.
Performance should have been restored.
We are still watching the cluster.
A network issue is preventing the logs system from working.
EDIT 13:17 UTC: Logs should be available, the cluster is slowly recovering
EDIT 13:23 UTC: The logs cluster is UP and running again, logs shouldn't have been lost thanks to buffering.
Sorry about the inconvenience.
A maintenance of our Git repositories will be held on Thursday (2018-08-09) at 1pm, UTC + 2.
Write operations like "git push" or "clever deploy" to Clever Cloud repositories won't be possible during 30min. Read access won't be affected during this time.
Thanks for your patience.
EDIT 13:00 UTC+2: The maintenance is starting
EDIT 13:05 UTC+2: The maintenance is now complete. Do not hesitate to open a support ticket if anything goes wrong. Thanks for your patience!
July 2018
We are investigation on connectivity issues on the File System Buckets
EDIT 10:27 UTC: Connections should now be working again. It seemed that already established connections were also impacted and were slower than expected. This should now also be fixed.
EDIT 10:27 UTC: FS Buckets service is now fully operational .
We are currently experiencing issues on our deployment systems.
EDIT 13:25 UTC: Recovery takes longer than expected, we are still working on it.
EDIT 13:59 UTC: We are still working on fixing these issues.
EDIT 14:08 UTC: We are still having issues but deployments can start.
EDIT 14:41 UTC: Deployments performance has been back to normal for more than 15 minutes now. We are still watching the situation closely. If you have an issue, please contact us.
June 2018
One of our hypervisor had a network issue for approximately 5 minutes.
Some of our internal services were impacted by this network issue and thus, automatic re-deployment of applications has been delayed.
Everything is back to normal, applications are currently finishing their redeployment.
Due to an ongoing maintenance from our provider, the logs system and a redis cluster of shared (and free) redis are unreachable. Logs may be lost. It should not last than 15 minutes according to them. A few minutes might be needed to restart the logs cluster.
Redis should be back as soon as the maintenance ends
EDIT 13:35 UTC: The maintenance is still ongoing
EDIT 13:50 UTC: The maintenance is over. Redis cluster is UP. Logs cluster is getting back UP. Logs should be saved but might not be directly available through the console
EDIT 14:30 UTC: The logs cluster is now fully operational too
One of our hypervisor has hard drive I/O failures. We are looking into it
EDIT 11:08 UTC: The server was shutdown a few minutes ago. Applications on it are being redeployed. Add-ons are currently unavailable
EDIT 11:52 UTC: We are still waiting for news from our provider regarding the hard drives issue
EDIT 21:20 UTC: Our provider is still working at finding the root cause of the issue
EDIT 2018-06-29 07:05 UTC: We received an answer from our provider and the server can't be brought back online. Databases will need migration. We are waiting an answer to know if we can access the disk in a read only mode to transfer the databases. If not, backups from the the 28th June will be used.
EDIT 2018-06-29 07:18 UTC: The disks can't be read. Backups will need to be used
The logs collector needs to be restarted. Some logs might be lost for one to two minutes.
EDIT 22:00 UTC: Restart took approximately 30 seconds, most applications sent again the logs they couldn't send during that time
HV is down/unreachable. There seems to be a hardware problem. We are investigating it.
Some databases are unreachable.
EDIT 2018-06-18T23:25:00 UTC: Seems to be a malfunctioning fan. The server is still down for investigation. We are waiting for more informations from our hypervisor provider. EDIT 2018-06-19T00:37:00 UTC: The malfunctioning fans have been replaced. The server is up again. All the databases are up and running.