Incidents
Full history of incidents.
July 2017
We will be doing a maintenance on the Cellar cluster starting on 2017-07-20 at 08:00 UTC.
This is a 2-steps maintenance, the second one will be scheduled at a later stage.
This should not have an impact on availability but may have a light to moderate impact on upload / download speeds.
No ETA as of now, we will be posting updates along the way.
EDIT 2017-07-20 08:00 UTC: Maintenance is starting now
EDIT 10:00 UTC: We are expecting the maintenance to end between 21:00 UTC and 2017-07-21 01:00 UTC ; we are seeing no significant impact on upload / download speeds as of now
EDIT 14:45 UTC: The maintenance is running fine and still has no significant impact on performance, we are keeping it as-is. Consider this event over; If something goes wrong, we will create a new event.
A maintenance of the logs system will happen at 10am UTC. Applications logs will be unavailable during this maintenance.
The maintenance should not last more than 1 hour.
EDIT 10:18 UTC: Maintenance started a few minutes ago, logs collection will be disabled in a few seconds
EDIT 10:44 UTC: Maintenance is over since a few minutes, logs are now available
An issue occurred on the main API. It was mostly unavailable, only answering to ~30% of requests at best for close to 10 minutes, until we switched to a backup system.
At this point, most services were available except for logs, events and notifications.
30 minutes after the beginning of this issue, it's now fully available.
Network is flaky in the Europe zone, we are seeing intermittent unreachability issues on multiple elements of our infrastructure. We are investigating.
EDIT 06:48 UTC: The network seems to work fine now. Deployments are unavailable, we are working on bringing them back up.
EDIT 07:35 UTC: Deployments have been back up since 07:15, we are still cleaning up the remaining items.
EDIT 07:40 UTC: Everything is cleaned up and functional now. If you have an issue, come ping us.
June 2017
Deployments are disabled for a short maintenance operation.
EDIT 16:12 UTC: Deployments are back
We are currently experiencing performance issues on a component of our deployment system. Deployments are delayed by a few minutes.
We are doing a maintenance operation on a component of our monitoring system. Deployments may be delayed until the end of the operation.
This should last no more than 10 minutes. Deployments should not be delayed by more than a couple minutes.
Maintenance operation will start at 09:10 UTC.
EDIT 09:19 UTC: Deployments should go back to normal in the next few minutes. Maintenance is over, we are now checking that everything is working fine.
EDIT 09:24 UTC: Deployments delay back to normal; end of incident
One hypervisor went down, affected applications are being automatically redeployed. Addons on this hypervisor are unreachable (~2% of dedicated addons in the Europe zone).
We are awaiting news from our provider.
EDIT 15:30 UTC: We are still awaiting a manual operation from our provider
EDIT 15:37 UTC: They have rebooted the server manually but "observed an error" and are "analyzing" the issue
EDIT 16:04 UTC: The power supply is out of order and is being replaced
EDIT 16:55 UTC: The operation is over, the server just rebooted and will now start recovering / cleaning up after the forced reboot. Databases will be coming back online automatically.
EDIT 17:50 UTC: Most databases are available since 17:15 UTC. The remaining databases are now available
An incident occurred in our monitoring tools. Old instances are unable to stop, thus causing instability in applications.
Deployments are stopped until the monitoring is back up and running.
We are working on fixing an issue with our applications and addons monitoring system of the Europe zone. Deployments have been disabled to allow the monitoring to catch up faster.
The addon gateway has been restarted, some connections have been forcibly closed.
The addon gateway has been restarted, some connections have been forcibly closed.
A core component of the deployment infrastructure will be upgraded to improve stability and performance. As a result, deployments will be stopped for up to 60 minutes (hopefully less)
EDIT 11:05 UTC: Maintenance is fully over now, deployments have been available since 10:50 UTC.
Deployments take more time to start due to higher than usual activity. We are working on fixing the problem.
EDIT 16:00 UTC: The deployment starting time is back to normal
May 2017
Deployments take more time to start due to higher than usual activity. We are working on fixing the problem.
Deployments are disabled following an incident on a component of our deployment system. We are working on bringing it back up.
ETA is about an hour.
Our monitoring system had a small network split making it think applications were unreachable. This triggered a lot of redeployments. This does not make applications unreachable. You might receive some emails with a "Monitoring/Unreachable" deployment reason.
Also, deployments are delayed until we clean the non-important redeployments
UPDATE 5:07PM UTC: Incident has been resolved, sorry for those redeployments
April 2017
We are investigating the problem.
UPDATE 12:43PM UTC: The problem has been resolved, we will investigate about why it happened and how to prevent this from happening again.
We are investigating a network issue affecting a reverse proxy for the addons.
EDIT: The issue is gone. It looks like it was a temporary network issue of our provider.
March 2017
Impacted applications are redeploying. Edit: resolved at 14:10 UTC