Incidents
Full history of incidents.
November 2018
We have experienced issues on deployments on Montréal (MTL) zone between (from 15:24 to 15:30 UTC).
A validation test of an update to the webhook API has made its way to production. Clients received events not meant for them for 8 minutes. This is now fixed.
Sorry for the inconvenience.
We are deploying a new feature on PostgreSQL addon, the creation and management of those addons is currently disabled.
A core component keeps restarting making the metrics unavailable for fetch. No metrics are lost during those restarts. We will take actions to fix this issue in the upcoming days.
An action was taken at 02:30 UTC (2018-11-21) which has successfully fixed this issue. This is only temporary though.
A permanent fix will be applied later today, which will require a downtime of that component.
EDIT 2018-11-21 16:50 UTC: The permanent fix is delayed to tomorrow, 2018-11-22.
EDIT 2018-11-22 10:40 UTC: The fix will be applied at 10:50 UTC, this will require at least one restart of that component which will lead to an unavailabiliy of Metrics for about 20 minutes.
EDIT 2018-11-22 11:25 UTC: Metrics are back since 11:08 UTC. Incident over.
A network issue (apparently) is affecting several hypervisors and services. We are investigating.
EDIT 19:21 UTC: Here is the incident of our provider: https://status.online.net/incident/153 (3 racks have lost public connectivity)
EDIT 20:33 UTC: The issue should be fixed. As of now, our monitoring is happy. We are cleaning up.
It has been reported that some database are slower than usual because of network slowness. We investigated and took actions against our reverse proxies. One of them has been fully restarted leading to loss of established connections. We are currently monitoring if those actions are improving the situation.
EDIT 12:10 UTC: The issue seems to be resolved now
Metrics currently can't be accessed. Metrics ingestion still works, only metrics fetching will not work.
EDIT 16:25 UTC: One of the component was failing due to a network configuration error. The network configuration has been fixed and the component is currently restarting. It should be restarted in about 15 minutes.
EDIT 16:40 UTC: The component has restarted, metrics are now available again for read actions. No data was lost. Sorry for the extended interruption.
A network issue is currently happening on our reverse proxies on the Montreal zone. We are currently working on it.
EDIT 13:28 UTC: The network issue has been resolved since 13:20 UTC. Everything should be back to normal. Sorry for those issues.
Our API was unavailable for 10 minutes. The CleverCloud console couldn't load and deployments wouldn't start. This has been fixed.
October 2018
A hypervisor is unreachable.
Affected applications are being restarted automatically.
Affected addons are unreachable.
EDIT 17:56 UTC: Looks like it's a network issue, we are awaiting word from our provider.
EDIT 18:08 UTC: Our provider tells us they are working on it, no ETA nor details given.
EDIT 18:26 UTC: There was a short electrical outage in the datacenter where this server is, some routers and switches have been impacted by the switch to the backup power source. They are working on fixing affected network hardware.
EDIT 18:44 UTC: The server is back, addons should be reachable. We are making sure that everything is back online.
EDIT 18:56 UTC: Everything is working fine. Incident closed.
Some of the nodes of the cluster crashed. They are currently being restarted. Users using this cluster may experience disconnections and failures to read / publish messages.
Update 16:34 UTC: The cluster nodes have been restarted. The cluster is UP again. Sorry for the inconvenience.
A human error caused an issue with the configuration of the add-ons reverse proxies at 12:18 UTC. MySQL dedicated add-ons were unavailable at this point, except for already open connections.
At 12:30 UTC, we found the cause of the issue.
At 12:32 UTC, the issue was fixed and we regenerated the reverse proxies configuration.
At 12:33 UTC, add-ons were available again.
We have put the necessary protections in place to prevent this from happening in the future.
Deployments actions (start, restart, stop, git push, ...) are slower than usual. We are looking into this.
13:09 UTC: We are going to restart one of the deployment core system. Deployments actions (like the one above) will be unavailable for up to 30 minutes. All actions will be queued and executed at the end of the maintenance.
13:40 UTC: Another problem occurred during the restart of that system. We are now trying to fix this one.
EDIT 14:03 UTC: Deployments are available since ~5 minutes now. We are still cleaning things up before closing this incident.
EDIT 14:30 UTC: Everything should be back to normal now. Sorry for the extra maintenance time and the deployments unavailability.
Our monitoring system had a network cut making it see a lot of applications unreachable. Those applications are being redeployed but it may add delays for new deployments (start / redeploy / stop) because of the number of ongoing deployments.
EDIT 12:50 UTC: The deployments should now be back to normal. Apologies for the delays.
September 2018
The GitHub API has changed so we are patching our API to fix the auto-deployments of new applications, the fix will be retroactive.
The follower stopped replicating data and taking backups.
We are trying to restart it.
Postgresql leader is down. Promoting follower and update domains.
DNS has been updated. Clients should connect back to the database
EDIT 12:22 UTC: The new leader is correctly serving requests since 0:30 AM UTC.
The shared cluster is experiencing performance issues. We are working to mitigate those issues.
Deployments using cache (build cache, dependencies cache) are failing because the cache can't be downloaded. We are investigating
EDIT 10:17 UTC: We are still working on the issue. If you have troubles deploying, you can set your application's scalability settings to which a dedicated build instance would use. Do not hesitate to ping our support if needed.
EDIT 10:25 UTC: ETA is 2 hours if everything goes well.
EDIT 12:30 UTC: The deployments with cache are back. Everything should work as expected from now. Sorry for any failed deployments or longer than expected deployment times.
One of the two reverse proxy of *.cleverapps.io crashed and had a longer than usual restart time. Traffic hitting this server didn't complete as requests would hang until the connection timed out.
The problem has been resolved at 16:08 UTC