Incidents

Full history of incidents.

June 2020

Important delay in Metrics ingestion 5 years ago

Fixed · Access Logs · Global

Metrics ingestion is delayed, we are investigating.

08:37 UTC: Ingestion delay is back to normal. Incident was caused by a few storage nodes misbehaving after a short network issue.

Core API is experiencing issues 5 years ago

Fixed · API · Global

Our Core API is experiencing issues that impact deployments, we are working on it.

EDIT 14:50 UTC: situation is back to normal.

[Paris] PHP sessions and temporary files failed to be written 5 years ago

Fixed · FS Buckets · Global

Due to an excessive amount of temporary files created by various PHP sessions in a very short time (+6GB in two minutes), the underlying fs-bucket for PHP sessions became full. You may have noticed various errors from 'write failed: No space left on device' to some more "random" errors that were caused by this.

Applications using redis as a session backend were not impacted by the session issue. It may have been impacted if your application generates temporary files, which are on the same fs-bucket.

Our clean-up policy of temporary files was not aggressive enough, we'll reduce it to once a day and will continue to monitor if we need to upgrade the current disk space.

This incident started at 11:33 UTC+2 and was fully resolved at 11:41 UTC+2

Cellar Buckets are slow 5 years ago

Fixed · Cellar · Global

We are investigating issues on Cellar Addons, we are experiencing network issues.

EDIT 15:20 UTC: fixed.

[Montreal] Reverse proxy configuration fails to automatically update 5 years ago

Fixed · Reverse Proxies · Global

On our MTL zone, reverse proxies are currently not able to update themselves once a configuration change happens (application deployment, added domain, ...). We are looking into it.

19:00 UTC: If your application deploys, your application will not be up-to-date. It will continue to show the old content, the old instance will be kept until this incident is over.

19:12 UTC: The issue has been identified, we are fixing it.

19:20 UTC: The issue was caused by the configuration checker that took way more time than usual before applying each configuration changes. A configuration option to disable those checks inside the program handling the configuration has been enabled. The configuration remains checked by the reverse proxy itself but it is way faster.

Deployments should now be up-to-date.

Paris zone: Network outage 5 years ago

Fixed · Infrastructure · Global

It seems a global network outage happened for 1 minute, leading to possible loss of connection to most of our services. It seems to be back for now but we are investigating and we will provide further information.

15:22 UTC: We continue to investigate what's been impacted. Currently deployment are disabled to recover from the event.

15:27 UTC: Deployments are now available.

15:56 UTC: The situation on the platform is stabilized. It seems the outage was between both of our datacenters in the Paris zone. We are asking for more details to our hosting provider.

16:05 UTC: Our network provider came back to us. The network outage lasted for 1 minute and 20 seconds. One of the links was lost between those two datacenters. The backup link should have been up 2 seconds after the loss of the first link. But for some reason it did not switch (or not correctly). After a 1 minute timeout, all links were closed and reset leading to a new link election which takes ~20 seconds. From there, the connection has been restored. Our network provider will continue to investigate why the initial backup link did not switch.

Once the network started working again, our monitoring was able to check what was currently "down". The services that were down were restarted but nothing should have impacted reaching your application (it was mostly internal services). Add-ons connections should have been back at the same time from applications but if your application crashed because it couldn't reach the add-on, then it should have been automatically redeployed once the deployment system was up again which should have been a bit before 15:27 UTC.

We are sorry for the inconvenience this outage created. The time of this incident has been changed from 15:06 UTC to 15:04 UTC to correctly match the date and hours.

Metrics unavailable 5 years ago

Fixed · Access Logs · Global

Metrics are currently unavailable.

An index node has been restarted to upscale it. Its replica did not like the surge of requests and decided to crash a few seconds later. We are currently in the process of upscaling all index nodes to avoid such issues, those 2 nodes were the last remaining on the list.

Index nodes have to scan the whole dataset on start, this will take close to an hour to resolve.

08:07 UTC: Incident is over.

Access logs not available 5 years ago

Fixed · Global

Access logs are not available since 2020-06-04 10:34 UTC. Everything is operational at 2020-06-04 11:25 UTC.

Metrics unavailability 5 years ago

Fixed · Access Logs · Global

Metrics are currently unavailable as the 2 replicas of a chunk of the index are down. Estimated time to resolution: 30 minutes.

13:38 UTC: Incident is over.

Metrics and access logs ingestion delay 5 years ago

Fixed · Access Logs · Global

Metrics and access logs currently have some delay in their ingestion. We are currently under one hour of delay with the gap closing at a low rate. Data for older metrics / access logs remain available. We are currently working to reduce the current ingestion delay.

EDIT 06:58 UTC: Ingestion is back at its normal rate, we are currently under the 30 minutes of delay. This should be at 0 seconds of delay in the next couple of minutes. EDIT 06:18 UTC: Ingestion delay is back to normal too since a few minutes. Incident is over. Everything (access logs / metrics) should have the latest data again.

May 2020

Access logs data loss and partial unavailability 5 years ago

Fixed · Global

Due to a backend issue, access logs between 17:05 UTC and 18:19 UTC were lost. All access logs emitted after 18:47 UTC can't be queried because they are not yet indexed. The indexation fails because of a issue with the GEO IP location feature. Once this is fixed, logs will then be indexed. This impacts accesslogs retrieval through the command line and the various stats displayed in the console (last 24 hours of requests, heatmap, live map, status code, ...).

Update 21:04 UTC: The GEO IP feature has been fixed. It seems to have initially broke with an auto update of the GEO IP library but more tests will need to be conducted to be sure of the root cause. All access logs between 18:47 UTC and now have been consumed and you should now be able to query them. We will work on improving the monitoring of the whole system to detect this kind of issue faster.

API & deployments disabled for maintenance 5 years ago

Fixed · Global

In an effort to deal with spikes of API load in recent days, the API and the deployments will be disabled for maintenance starting at 20:30 UTC tonight (22:30 Paris time) for a duration of up to 10 minutes. Thank you for your understanding.

20:30 UTC: Maintenance is starting.

20:35 UTC: API and deployments are back up, maintenance is over.

Deployment failures 5 years ago

Fixed · Deployments · Global

Deployments were failing without any reasons or any logs in the console / CLI. The root cause has been identified and fixed at 10:25 UTC. All deployments started after 09:55 UTC failed and must be restarted.

Sorry for the inconvenience. We keep watching the status of the deployment system to make sure the problem is indeed resolved.

EDIT 10:40 UTC: Everything is back to normal.

One of the add-ons reverse proxies had to be rebooted 5 years ago

Fixed · Reverse Proxies · Global

19:43 UTC: one of the add-ons reverse proxies stopped responding to a part of the requests. 19:45 UTC: We restarted it. Some still working connections were lost, but the reverse proxy is now operational.

Metrics unavailable 5 years ago

Fixed · Access Logs · Global

Metrics cannot be read at the moment because of an issue with the index components.

A chunk and its replica are both non-responding which means the service as a whole is unavailable. We are working on it.

10:00 UTC: An index node being unavailable threw us off on the wrong track. Its replica was actually working just fine, the issue was with both front read nodes being stuck at the same time. We will improve monitoring and try to figure out what went wrong and why.

April 2020

Metrics ingestion issue 5 years ago

Fixed · Access Logs · Global

A misconfiguration on new Metrics storage nodes caused a bad cluster state which in turn causes issues with the ingestion. We are working on fixing this issue.

14:35 UTC: We found the cause of the issue and are working on fixing it.

14:47 UTC: The root cause is fixed and the ingestion is now running at full speed. The misconfiguration issue was just half the story, what caused this issue was a partial network split.

14:56 UTC: Ingestion is all caught up. Incident is over.

Access logs currently unavailable 5 years ago

Fixed · Global

Access logs are currently unavailable. Queries to get access logs might not work as expected. We are working on a fix.

EDIT 13:12 UTC+2: It also impact the real time map in the console. You may not see live queries to your applications. But your application still receive the requests as usual.

EDIT 13:22 UTC+2: Fixed; but during the downtime period the access logs were deleted. We identified the root cause and are fixing it.

MongoDB shared cluster: A datanode was unreachable 6 years ago

Fixed · MongoDB shared cluster · Global

At 10:12 UTC, a MongoDB data node of the free and shared cluster went down. It has been restarted at 10:16 UTC. Connections and queries during that time frame may have failed.

Dedicated add-ons (XS SmallSpace and above) were not impacted

Metrics and access logs ingestion is experiencing issues 6 years ago

Fixed · Access Logs · Global

Metrics and AccessLogs are currently unavailable due to issues. We are working to fix them.

06:38 UTC: Everything is back online, ingestion is catching up.

06:52 UTC: Ingestion delay is back to normal.

FSBuckets write issues 6 years ago

Fixed · FS Buckets · Global

One of our FSBucket system is experiencing issues on write actions. We have identified the issue and are working to fix it.

EDIT 13:01 UTC: fixed.