Incidents
Full history of incidents.
February 2018
Our monitoring system has detected network connectivity issues. Issues were caused by a network configuration inconsistency, they are solved.
NodeJS applications are failing to deploy because of the missing nomnom module. We are investigating the issue.
EDIT 10:53 UTC: You can create the following environment variable for a temporary workaround: CC_PRE_RUN_HOOK=npm install nomnom@1.8.1 -g
EDIT 11:33 UTC: A fix has been made and the new image version is now deploying on our servers.
EDIT 12:33 UTC: The new image is now live. All NodeJS applications will be redeployed to avoid using a now broken image.
The metrics data cluster is under unusual load. Metrics display is currently unavailable, but metrics are still collected.
EDIT 17:35 UTC: Service is back to normal and collected metrics have all been correctly persisted.
The proxy is being restarted. Some add-ons may be unreachable until it's done.
EDIT 15:42 UTC: Incident over since 15:40 UTC.
The log storage cluster is experiencing network issues. We are working on it. In the meantime, only realtime logs are available.
January 2018
The proxy is being restarted. Some add-ons may be unreachable until it's done
EDIT 16:41 UTC: the proxy has been successfully restarted. Add-ons should be reachable again. Applications not supporting the loss of an established connection will be redeployed. We continue to monitor the proxy.
EDIT 17:30 UTC: the incident is now over
A redis cluster was down and is restarting
EDIT 20:17:00 UTC: The cluster has been restarted, impacted applications have been redeployed. The incident is over
PostgreSQL addon dashboards will be unavailable for about 15 minutes starting on 2018-01-25 at 12:30 UTC
EDIT: Delayed to 12:50 UTC
EDIT 12:50 UTC: Will start in a few seconds
EDIT 13:07 UTC: Maintenance over. If you encounter an issue, please tell us.
Logs are currently unavailable. We are working on restoring them. All logs sent in the last 30 minutes won't be stored.
EDIT 03:15 UTC: Logs are back again
The MongoDB shared cluster needs to be upgraded to have more resources.
Performance issues and or partial outage are to be expected. We will try to keep them as low as possible.
The maintenance starts at 22:00 UTC
EDIT 02:00 UTC: the maintenance is now over
An addon reverse proxy is restarting, connections are dropped and impacted applications will be redeployed
EDIT 20:45:00 UTC: The reverse proxy took ~1 minute to restart. It is now restarted
EDIT 20:48:00 UTC: Impacted applications were redeployed as expected. The incident is now over and all add-ons are now reachable again
All deployments from around 15:40 UTC might be shown in a FAILED state, even though they were successful. It's just a matter of display and the instances, if correctly deployed, are put into production.
The Activity pane (Console), clever status (cli) and the API endpoint /applications/<app>/deployments incorrectly report the deployment status.
Notifications (slack webhooks, mails) correctly report the deployment status (failed or successful) and can be trusted.
EDIT 21:48 UTC: It should now be fixed. Deployments with the "FAILED" state will keep their broken state.
Network instability on Online DC2 makes some products unreachable:
- Mysql shared cluster
- Postgresql shared cluster
- Mongodb shared cluster
- One of the cleverapps front proxies
The shared mongodb cluster is experiencing issues, we're working on bringing it back up.
Due to disk space, we need to lower the number of logs we store, for now. Only the last 4 days are kept, instead of the ideal number of last 7 days.
EDIT 2018-06-15 UTC: All 7 days are now available again.
December 2017
A core component will be upgraded. Deployments will be disabled for an hour starting at 11:30 UTC. This upgrade should fix some deployments delay among other things.
EDIT 11:31 UTC: Maintenance is starting
EDIT 12:06 UTC: Deployments are back, we are now cleaning some old artefacts
EDIT 13:00 UTC: The maintenance is over
Our deployment system encounter some slow down. Some application may take longer than usual to deploy. We are working on it
EDIT 19:25 UTC: Those slow downs might require an infrastructure change that will be done next week. Until then, slow downs should be less frequent and less important
EDIT 2017-12-08: 12:00 UTC: Deployments take less time after some fixes on our end. The migration will still happen to entirely fix it. Incident is considered as closed because we don't see any more extra times.
We've observed an elevated error rate on two front load balancers newly added to the pool. We're pulling traffic back from these load balancers.
November 2017
Some deployments might have troubles starting a deployment. We are investigating.
EDIT 17h31 UTC: Deployments are disabled for now EDIT 17h38 UTC: Deployments are now back up but may be stopped again in a few minutes if needed EDIT 17h55 UTC: The incident is now resolved. We will keep an eye on it for the upcoming days
We are experiencing a network issue on one of our front. The support team is actively working on this.
EDIT 14:56 UTC+1: Unreachable servers are being restarted and will be available shortly. In the meantimes, impacted applications are being redeployed
EDIT 15:26 UTC+1: The team is performing the final cleanup. The issue is about to be closed. The remaining apps and add-ons are being restarted.
EDIT 15:50 UTC+1: The outage is now resolved. Contact the support is you encounter any trouble.