Incidents

Full history of incidents.

November 2018

Deployments on Montréal (MTL) are down. 7 years ago

Fixed · Deployments · Global

We have experienced issues on deployments on Montréal (MTL) zone between (from 15:24 to 15:30 UTC).

Fixed · API · Global

A validation test of an update to the webhook API has made its way to production. Clients received events not meant for them for 8 minutes. This is now fixed.

Sorry for the inconvenience.

Upgrade PostgreSQL addon. 7 years ago

Fixed · API · Global

We are deploying a new feature on PostgreSQL addon, the creation and management of those addons is currently disabled.

Metrics partial unavailability 7 years ago

Fixed · Access Logs · Global

A core component keeps restarting making the metrics unavailable for fetch. No metrics are lost during those restarts. We will take actions to fix this issue in the upcoming days.

An action was taken at 02:30 UTC (2018-11-21) which has successfully fixed this issue. This is only temporary though.

A permanent fix will be applied later today, which will require a downtime of that component.

EDIT 2018-11-21 16:50 UTC: The permanent fix is delayed to tomorrow, 2018-11-22.

EDIT 2018-11-22 10:40 UTC: The fix will be applied at 10:50 UTC, this will require at least one restart of that component which will lead to an unavailabiliy of Metrics for about 20 minutes.

EDIT 2018-11-22 11:25 UTC: Metrics are back since 11:08 UTC. Incident over.

Network issue 7 years ago

Fixed · Infrastructure · Global

A network issue (apparently) is affecting several hypervisors and services. We are investigating.

EDIT 19:21 UTC: Here is the incident of our provider: https://status.online.net/incident/153 (3 racks have lost public connectivity)

EDIT 20:33 UTC: The issue should be fixed. As of now, our monitoring is happy. We are cleaning up.

Network slowness / connectivity issues on dedicated add-ons 7 years ago

Fixed · Infrastructure · Global

It has been reported that some database are slower than usual because of network slowness. We investigated and took actions against our reverse proxies. One of them has been fully restarted leading to loss of established connections. We are currently monitoring if those actions are improving the situation.

EDIT 12:10 UTC: The issue seems to be resolved now

Metrics can't be accessed / viewed 7 years ago

Fixed · Access Logs · Global

Metrics currently can't be accessed. Metrics ingestion still works, only metrics fetching will not work.

EDIT 16:25 UTC: One of the component was failing due to a network configuration error. The network configuration has been fixed and the component is currently restarting. It should be restarted in about 15 minutes.

EDIT 16:40 UTC: The component has restarted, metrics are now available again for read actions. No data was lost. Sorry for the extended interruption.

[Montreal] Network issues on our load balancers 7 years ago

Fixed · Infrastructure · Global

A network issue is currently happening on our reverse proxies on the Montreal zone. We are currently working on it.

EDIT 13:28 UTC: The network issue has been resolved since 13:20 UTC. Everything should be back to normal. Sorry for those issues.

API / Console / Deployments unavailability 7 years ago

Fixed · API · Global

Our API was unavailable for 10 minutes. The CleverCloud console couldn't load and deployments wouldn't start. This has been fixed.

October 2018

Hypervisor unreachable 7 years ago

Fixed · Infrastructure · Global

A hypervisor is unreachable.

Affected applications are being restarted automatically.

Affected addons are unreachable.

EDIT 17:56 UTC: Looks like it's a network issue, we are awaiting word from our provider.

EDIT 18:08 UTC: Our provider tells us they are working on it, no ETA nor details given.

EDIT 18:26 UTC: There was a short electrical outage in the datacenter where this server is, some routers and switches have been impacted by the switch to the backup power source. They are working on fixing affected network hardware.

EDIT 18:44 UTC: The server is back, addons should be reachable. We are making sure that everything is back online.

EDIT 18:56 UTC: Everything is working fine. Incident closed.

Unavailability of the RabbitMQ shared cluster 7 years ago

Fixed · RabbitMQ shared cluster · Global

Some of the nodes of the cluster crashed. They are currently being restarted. Users using this cluster may experience disconnections and failures to read / publish messages.

Update 16:34 UTC: The cluster nodes have been restarted. The cluster is UP again. Sorry for the inconvenience.

Dedicated MySQL addons unavailability 7 years ago

Fixed · Infrastructure · Global

A human error caused an issue with the configuration of the add-ons reverse proxies at 12:18 UTC. MySQL dedicated add-ons were unavailable at this point, except for already open connections.

At 12:30 UTC, we found the cause of the issue.

At 12:32 UTC, the issue was fixed and we regenerated the reverse proxies configuration.

At 12:33 UTC, add-ons were available again.

We have put the necessary protections in place to prevent this from happening in the future.

Deployments slowdown 7 years ago

Fixed · Deployments · Global

Deployments actions (start, restart, stop, git push, ...) are slower than usual. We are looking into this.

13:09 UTC: We are going to restart one of the deployment core system. Deployments actions (like the one above) will be unavailable for up to 30 minutes. All actions will be queued and executed at the end of the maintenance.

13:40 UTC: Another problem occurred during the restart of that system. We are now trying to fix this one.

EDIT 14:03 UTC: Deployments are available since ~5 minutes now. We are still cleaning things up before closing this incident.

EDIT 14:30 UTC: Everything should be back to normal now. Sorry for the extra maintenance time and the deployments unavailability.

Deployments for Monitoring/Unreachable and incoming slowness 7 years ago

Fixed · Deployments · Global

Our monitoring system had a network cut making it see a lot of applications unreachable. Those applications are being redeployed but it may add delays for new deployments (start / redeploy / stop) because of the number of ongoing deployments.

EDIT 12:50 UTC: The deployments should now be back to normal. Apologies for the delays.

September 2018

GitHub integration 7 years ago

Fixed · Global

The GitHub API has changed so we are patching our API to fix the auto-deployments of new applications, the fix will be retroactive.

Shared Postgresql replication and backups are down 7 years ago

Fixed · PostgreSQL shared cluster · Global

The follower stopped replicating data and taking backups.

We are trying to restart it.

Shared PostgreSQL cluster is DOWN 7 years ago

Fixed · PostgreSQL shared cluster · Global

Postgresql leader is down. Promoting follower and update domains.

DNS has been updated. Clients should connect back to the database

EDIT 12:22 UTC: The new leader is correctly serving requests since 0:30 AM UTC.

MongoDB shared cluster: Performance issues 7 years ago

Fixed · MongoDB shared cluster · Global

The shared cluster is experiencing performance issues. We are working to mitigate those issues.

Deployments with cache are failing 7 years ago

Fixed · Deployments · Global

Deployments using cache (build cache, dependencies cache) are failing because the cache can't be downloaded. We are investigating

EDIT 10:17 UTC: We are still working on the issue. If you have troubles deploying, you can set your application's scalability settings to which a dedicated build instance would use. Do not hesitate to ping our support if needed.

EDIT 10:25 UTC: ETA is 2 hours if everything goes well.

EDIT 12:30 UTC: The deployments with cache are back. Everything should work as expected from now. Sorry for any failed deployments or longer than expected deployment times.

*.cleverapps.io partial unavailability 7 years ago

Fixed · cleverapps.io domains · Global

One of the two reverse proxy of *.cleverapps.io crashed and had a longer than usual restart time. Traffic hitting this server didn't complete as requests would hang until the connection timed out.

The problem has been resolved at 16:08 UTC