Incidents
Full history of incidents.
June 2022
Our mail provider is currently experiencing issues. You may notice delays in receiving emails for notification, password forgotten, or account signup, billing and other services. You may also experience errors when clicking on links in those emails, like "Bad request".
EDIT 13:55 UTC: Our provider now indicates that emails should now be received with some delays.
EDIT 16:15 UTC: Email delivery should now be working fine again. Our provider's incident is over.
We are seeing an unusual amount of 503 errors on public reverse proxies, we are looking into it.
EDIT 21:28 UTC: The issue has been found and fixed. We are monitoring the situation.
EDIT 21:40 UTC: Everything seems to be back to normal. The issue was happening for a couple of applications starting around 16:30 UTC. We will investigate further on why its configuration was out of sync during that time period.
16:13:00 UTC: A hypervisor has stopped responding. We are investigating why. The system is redeploying the applications that were on it. Some reverse proxies are not responding.
16:24:00 UTC: At first look, it seems that a network error is making us see that hypervisor as down. No information yet on if it's a hardware or software network issue.
16:28:00 UTC: The hypervisor seems to be back up again. We are making sure everything on it is responding well.
16:40:00 UTC: Everything has been check and is responding correctly.
Impacts:
- Some add-ons became unresponsive.
- Logs were not served.
- One public reverse proxy was unresponsive. Traffic should have been diverted to others. Applications may have been a bit slow.
- Some custom services for customers were unresponsive.
Deployments are currently experiencing various issues, we are investigating.
EDIT 14:55 UTC: The problem has been identified and fixed. Deployments should now be working for the last 10 minutes. Sorry for the inconvenience.
An add-on reverse proxy was unreachable between 14:45 and 14:48 UTC. It has been restarted and is now serving requests as expected. Applications may have failed reaching their add-on during this time.
Our monitoring shows abnormal CPU usage on some Pulsar brokers, we are investigating.
EDIT: we stop some components which were increasing load of the cluster. it should be more stable now
[Times in UTC] 19:30: We are experiencing network issues in our Paris data center.
19:40: The culprit is a switch that half stopped responding. Turns out that it's not broken enough so its routes are automatically removed. Our DC contractor is moving to physically remove the switch. ETA is 30 minutes.
20:00: Cellar seems to be up again. We are still watching and waiting for a direct confirmation from our DC contractor.
00:00: Everything is back to normal
The unique IP service will undergo a maintenance period for 30 minutes on June 7th starting at 20:00 UTC. During this time period, the service will be unavailable. Applications using the service will encounter timeouts or various errors when trying to use the service.
Applications will automatically be restarted once the maintenance is over.
EDIT 20:05 UTC: The maintenance is beginning
EDIT 20:28 UTC: The downtime was reduced to a few minutes but multiple network cuts may have happened. Applications linked to this service are currently redeploying.
A hardware failure occurred on one of our server (hv-par4-001) Applications are being redeployed on other ones Addons are impacted
After an anormal CPU load, one of the Mongodb did not restart.
EDIT: trying to repair database files EDIT: database filesystem repaired
EDIT 04/06: MongoDB process has restarted. Some customer perform expensive queries on the MongoDB cluster, which can cause an OOM of the process,
EDIT 06/06 10:31:06 UTC: mongodb-c2 is still experiencing issues, we are working on it.
EDIT 06/06 11:24:00 UTC: Because of a replication recovery bug not fixed by MongoDB on pre-SSPL version, we are working on making databases back from the previous backups made overnight. Everything should be back on in the afternoon. Users can setup new dedicated database with the previous backups for faster recovery.
EDIT 06/06 13:45:00 UTC: Restore process has began, it will take a few hours. We will keep you posted.
EDIT 06/06 15:01:00 UTC: We restored half of the customers. We are expecting full recovery in a few hours.
EDIT 06/06 17:01:00 UTC: An issue occured while restoring the databases. We are investigating.
EDIT 06/06 23:00:00 UTC: We restored all the databases that were not above usage quota. The cluster is now running and we improved how we export connection data so applications will behave better when connecting.
Current state:
- DBs have been imported from backups. Backups that were above the free quota were not imported.
- Connection URIs have been updated to include the whole replica set. This will simplify and stabilize how applications connect to the cluster.
May 2022
We identified one flaky TCP reverse proxy in the Montreal zone. We are investigating.
EDIT 20:37 UTC - fixed.
There are some issues with this service for now, applications traffic may not be routed through the proxy and may end up using another IP (hypervisor's IP) instead. We are investigating.
EDIT 09:21 UTC: The issue should have been fixed. Your applications might need to be redeployed if the issue persists. We continue to monitor the service.
EDIT 13:11 UTC: We didn't see any other issues with the service, the issue is now resolved.
We are currently having an unreachable hypervisor on the Paris zone due to a connection loss. We are trying to restart it. Impacted applications are automatically redeployed.
EDIT 22:46 UTC: The hypervisor doesn't reboot, we continue our investigation.
EDIT 00:06 UTC: The hypervisor is back online since a few minutes. All services are now available again. The extended period of downtime has been identified and will be fixed on similar hypervisors to have a faster recovery next time.
Ingestion of new access logs and metrics points is currently having an issue, leading to missing data points in metrics. Access logs ingestion is currently on hold and will be processed later. The issue has been identified and we are working to fix it.
EDIT 21:04 UTC: Ingestion is now back to normal. Access logs will be processed over the next few hours.
We lost a server which host severval components on PAR zone
UPDATE: all applications have been redeployed
Some applications are experiencing issues. We are investigating it.
UPDATE 14:57 UTC: Some Add-ons are being inaccessible due to a faulty proxy. We're removing it from the pool to mitigate.
UPDATE 14:59 UTC: Services are being reloaded to ensure the faulty proxy is removed from the pool.
UPDATE 15:10 UTC: Services are back online for redeployed apps. A faulty sentry induced an abnormal behaviour in the API.
CALL FOR ACTION 15:23 UTC: Remaining applications are currently redeployed. If you're impacted, we advise you to redeploy your app to accelerate the recovery process
We currently have issues with deployments. Deployments may end up with errors asking you to contact our support alongside a stacktrace. We are currently working on a fix.
EDIT 14:59 UTC - We have identified defaulting component which encounters an issue in the connection pooler.
EDIT 15:09 UTC - deployments queue is being consumed and catching up. Issue it mitigated.
EDIT 15:23 UTC - Incident is fixed.
Root cause: we've found an issue in a messaging driver on a couple of isolated servers. Anyway, we've curated out this specific driver to fall back on an alternative messaging layer. In the coming days, we will dive into this specific bug we've found and will communicate the bug fix upstream.
The MySQL c5 shared cluster is experiencing issues. We are investigating.
EDIT 20:02 UTC: the MySQL shared cluster is back online.
Logs are currently having some ingestion/query issues. We are working on it.
EDIT 21:39 UTC - querying logs is now available.
The MySQL c6 shared cluster of EU zone is experiencing issues. We are investigating.
EDIT 21:39 UTC - shared cluster is now back online