Incidents
Full history of incidents.
July 2021
The logs systems (drains included) are experiencing issues, we are working on it.
EDIT 9:22 UTC - fixed.
Logs are currently having an ingestion delay. Drain logs are also impacted. The queue is currently getting consuming at normal rate, everything should come back in order in a few minutes.
EDIT 13:42 UTC: The ingestion stopped again, we continue looking into it.
EDIT 14:05 UTC: We continue to investigate the issue. If you need to access the logs of your application, you can SSH to the VM and display them: https://www.clever-cloud.com/doc/reference/clever-tools/ssh-access/#show-your-applications-logs
EDIT 14:30 UTC: Some part of the ingestion queue couldn't have been consumed and has been lost. The queue is still being consumed so up-to-date logs are still delayed
EDIT 17:15 UTC: The queue has been fully consumed and the logs are now up-to-date.
Logs requests are currently unavailable alongside logs drains emitting. We are looking into it.
EDIT 12:35 UTC: Logs are back, query should now work again and logs drains should have been sent to their endpoints. No logs have been lost.
June 2021
An hypervisor is currently unavailable. Applications are currently restarting. Add-ons hosted on that hypervisor are currently unavailable. We are looking into the root cause.
EDIT 14:45 UTC: The server won't reboot as of now, we are not yet sure of the reason. We continue to look into it. In the meantime, you can create a new add-on and import last night's backup. Please contact our support team for any further assistance
EDIT 14:58 UTC: The server still won't reboot, we continue to investigate the reason.
EDIT 15:08 UTC: A ticket has been opened to the manufacturer. The server is still unreachable as of now.
EDIT 15:12 UTC: A server replacement is currently being discussed. In the meantime, we advise you to import last night's backup into a new add-on. If the hypervisor ever comes back, you will be able to access your old add-on and possibly access the data between last night's backup and now, allowing you to merge them if possible. Current ETA is 24 hours.
EDIT 16:38 UTC: No server replacement will happen, we'll have more information to share tomorrow once the manufacturer gets back to us.
EDIT 16:54 UTC: Clarification: No server replacement will happen tonight. There are no sign of disk / data corruption, it seems to only be an hardware problem, which we can't fix right now.
EDIT 29/06/21 09:30 UTC: A maintenance on the server should happen in the next few minutes. The goal is to replace the problematic hardware piece. More information to come.
EDIT 13:17 UTC: The maintenance has been performed and a hardware piece has been changed but it didn't fix the issue. We continue investigating.
EDIT 13:26 UTC: The initial hardware replacement was the network card. Another replacement, this time the motherboard, has been planned for tomorrow. We do not yet have the exact time.
EDIT 30/06/21 11:09 UTC: The motherboard has been changed, additional checks are being performed.
EDIT 13:03 UTC: The motherboard replacement did not improve the situation. The server reboots fine without the network card, which has already been changed. A full server replacement is being considered by the manufacturer.
EDIT 18:23 UTC: Our infrastructure provider has been able to provide us with a temporary replacement server which is now up and running. Add-ons and custom services are all up and running. Do note that this is a temporary replacement, once the manufacturer gives us back the fixed server or a fully working permanent replacement, we will have to switch to it (meaning a shutdown of a few minutes). Affected customers will be e-mailed about this.
Some applications may fail to deploy because they try to compile on a runtime instance when a build instance has been configured. Explicitely triggering a rebuild should fix the issue.
2021-06-22
We are currently having connectivity issue or high latency to some part of our Paris infrastructure. Our network provider is aware of the issue and is currently investigating.
10:03 UTC: It seems like the issue is only affecting one of the datacenter. Applications that use services deployed on another datacenter might suffer from connectivity issue or increased latency
10:15 UTC: We are removing the IPs of the affected datacenter from all DNS records of load balancers (public, internal and Clever Cloud Premium customers) and are awaiting more info from our network provider.
10:19 UTC: Packet loss and latency have been going down from 10:12 UTC and it seems to be back to normal now. We are awaiting confirmation of the actual resolution of the incident.
10:23 UTC: We are working on resolving issues caused by this network instability and making sure everything works fine.
10:25 UTC: Logs ingestion is fixed. We are working on bringing back Clever Cloud Metrics.
10:31 UTC: IPs removed from DNS records at 10:15 UTC will be added back once we have confirmation that the network issue is definitely fixed.
10:41 UTC: Full loss of connectivity between the two Paris datacenters for a few seconds around 10:39 UTC. We are still experiencing packet loss now. Our network provider is working with the affected peering network on this issue.
10:45 UTC: The two Paris datacenters are unreachable depending on your own network provider.
10:49 UTC: Network is overall very flaky. Our network provider and peering network provider are still investigating.
10:57 UTC: According to our network provider, many optical fibers in Paris are deteriorated. Some interconnection equipment might be flooded. We are waiting for more information.
11:02 UTC: (Network and infrastructure inside each datacenter are safe. The issue is clearly happening outside the datacenters.)
11:13 UTC: Network is still flaky. Overall very slow. We are still waiting for a status update from our network and peering providers.
11:20 UTC: Network seems better towards one of the datacenters. We invite you to remove all IPs starting by "46.252.181" from your DNS.
11:42 UTC: Still waiting for information from our network providers. Still no ETA.
12:16 UTC: Network loss between the datacenters has lowered a bit. Console should be more accessible.
12:21 UTC: Connections are starting to come back UP. We are still watching and waiting for more information from our network providers.
12:30 UTC: Info from provider: over the 4 optical fibers, 1 is "fine". They cannot promise this one will stay fine. They are still working on it. Teams have been dispatched on the premises.
13:15 UTC: Network is still stable. We are keeping Metrics down for now as it uses a significant amount of bandwidth between datacenters.
13:48 UTC: A second optical fiber is back UP. According to our provider, "it should be fine, now". The other two fibers are still down. The on-site teams are analysing the situation.
13:41 UTC: You can now add back these IPs to your domains:
@ 10800 IN A 46.252.181.103
@ 10800 IN A 46.252.181.104
15:35 UTC: We are bringing Clever Cloud Metrics back up. It's now ingesting accumulated data in the queue while the storage backend was down.
16:45 UTC: Clever Cloud Metrics ingestion delay is back to normal.
17:16 UTC: The situation is currently stable but may deteriorate again. We are closely monitoring it. A postmortem will be published in the following days. If the issue comes back, this incident will be updated again. Sorry for the inconvenience.
17:31 UTC: A 30 seconds network interruption happened between 17:22:42 and 17:23:10, it was an isolated maintenance event done by the datacenter's network provider.
2021-06-23
07:01 UTC: This incident has been set to fixed as everything has been working fine, as expected, since the second optical fiber link has been restored except for the incident mentioned in the previous update. Do note that as of now we are not at the normal redundancy level as the other two optical fiber links are still down. We will update this once we have more information.
10:23 UTC: We have confirmation that a non-redundant third optical fiber link has been added at 00:30 UTC, this is only meant to add bandwidth capacity, it does not solve the redundancy issue. However, our network provider also tells us that their monitoring shows that the redundant link just came back up; although this may be temporary and the link may not be using the usual optical path.
16:13 UTC: The redundant link that came back at 10:23 UTC is stable. It may be re-routed to use another physical path at some point but we can now consider that our inter-datacenter connectivity is indeed redundant again.
From 08:41 UTC to 08:52 UTC, deployments have been queued up and very few deployments were starting.
This was due to an update that has now been rolled back.
Post Mortem
(The original incident text can be found at the end)
A network issue caused 17 minutes of full unreachability of the Paris zone which in turn caused some applications to go down and our deployment system to slow down while restarting affected applications as well as several other services.
Timeline
10:12 UTC: The whole PAR network is unreachable from outside, cross-datacenter network is down as well.
10:16 UTC: The on-call team is warned by an external monitoring system.
10:21 UTC: Our network provider informs us that they are aware of the issue.
10:29 UTC: The network is back.
10:30 UTC: The monitoring systems are starting to queue a lot of deployments. The load of one monitoring system in charge of one of the PAR datacenters increases significantly. Other systems such as Logs, Metrics, and Access Logs (collection and query) are also impacted and unavailable. Some applications relying on FSBucket services (mostly PHP applications) are also having communication issues with their FSBuckets. This might have made some applications unreachable and their I/O very high, sometimes leading to Monitoring/Scaling deployments. This particular issue was detected later during the incident.
10:35 UTC: Our network provider confirms to us that the issue is fixed.
10:50 UTC: Deployments are slow to start because many of them are in queue.
11:00 UTC: The load of the faulty monitoring system being too high causes it to see more applications down than there actually are, and to queue even more deployments for applications that were actually reachable.
11:15 UTC: Clever Cloud Metrics is back, delayed data points have been ingested. Writing to the ingestion queue is still subject to problems.
11:20 UTC: We notice the build cache management system is overloaded, slowing down deployments and failing those that rely on the build cache feature. The retrying of these failed deployments adds even more items to the deployment queue.
11:28 UTC: We start upscaling the build cache management system beyond its original maximum setting.
11:52 UTC: We believe an issue found in the past few days within the build cache management system is responsible for the slowness/unreachability of the build cache service. This issue caused a thread leak which had been triggering more upscalings than usual. A fix was being tested on our testing environment but was not yet validated. We try to push this fix to production.
12:48 UTC: The fix pushed to production at 11:52 UTC is not effective. We upscale the build cache management system again.
13:00 UTC: Logs collection is back. Logs collected before this time were lost. Queries are also available but might still fail sometimes or return delayed logs.
13:05 UTC: We prevent the overloaded monitoring system from queuing up more deployments and empty out its internal alerting queue.
13:10 UTC: We rollback a change made on the database a few days ago, which we believe is the root cause of the ongoing issue.
13:16 UTC: The build cache management system database load starts to go up. This is caused by the application being more effective at making requests to the database thanks to the previous rollback.
13:18 UTC: The build cache management system database is overloaded.
13:33 UTC: We start looking into optimizing requests and clearing up stale data.
13:59 UTC: We manage to bring the build cache management system database load down.
14:05 UTC: The build cache management system is still overloaded/slow despite its database now working properly. A deployment is queued with an environment config change but is slow to start. We restart the application manually to apply this change.
14:10 UTC: The change of configuration is effective, the deployment queue starts to empty itself but there are still a lot of deployments in the queue.
14:15 UTC: An older deployment, performed without the environment change which was waiting to be processed, finishes successfully, leading to about half of the build cache requests failing.
14:17 UTC: We start reapplying the fix manually on live instances while a new deployment with the correct environment is started. The deployment queue size is going down.
14:29 UTC: The deployment queue is filling up again.
14:53 UTC: We realize the faulty monitoring system is still queuing deployments despite its alerting queue being empty and the alerting action being disabled.
14:57 UTC: We completely restart the faulty monitoring system and make sure it stops queuing deployments.
15:10 UTC: We are now certain the previously faulty monitoring system stopped queuing deployments for false positives. The deployment queue is back to normal and the deployment system is more reactive.
15:15 UTC: We start cleaning stuck deployments and making sure everything is working fine.
15:42 UTC: We start redeploying all Paris PHP applications which have not been deployed since the network came back.
16:00 UTC: Some PHP deployments seem to be failing due to a connection timeout to their PHP session stored on an FSBucket. We abort the PHP deployment queue to avoid any more errors.
16:10 UTC: The connection was only broken on one hypervisor and is now fixed. We also make sure every other hypervisor can contact all FSBucket servers on the PAR zone.
16:15 UTC: The PHP deployments queue is started again, with a lower delay between deployments.
16:42 UTC: Clever Cloud Metrics / Access logs ingestion is now fixed. Queries should be returning up-to-date data. Access logs were stored in a different queue and have been entirely consumed.
17:05 UTC: The PHP deployments queue is now completed. All other applications in the PAR zone, which had not been redeployed since the network came back, have also been queued for redeployment to fix any connection issue to their FSBucket add-ons.
19:10 UTC: A few applications which have the “deployment with downtime” option enabled were supposed to be UP but had no running instances. Those applications are now being redeployed.
Network incident details
Foreword: Clever Cloud has servers in two datacenters in the Paris zone (PAR). In this post-mortem, they are named PAR4 and PAR5.
A routine maintenance operation made by our Network Provider on PAR4 started a few minutes before the incident. This maintenance was about decommissioning a router that shouldn’t impact the network. Various checks and monitoring were in place, as usual, and a quick rollback procedure was planned in case anything went wrong.
The decommission triggered an unexpected election of another router, which then triggered a lot of LSA (link-state advertisement) updates between all the routers of the datacenter, sometimes doubling them. Those updates created new LSA rules on other routers, which first made them slower to update and routing traffic. Some of the routers then hit a configuration limit on the number of LSA rules. When hitting the limit, the router went into protection mode and shut itself down. This shutdown triggered other LSA updates on other routers which then also hit their LSA limit and entered in protection mode. This isolated PAR4 site from the network.
An internal equipment that had a link between PAR4 and PAR5 also propagated those LSA updates onto PAR5 routers, replicating the exact same scenario.
To fix this, our Network Provider disconnected some routers, lowering the number of LSA announcements across the network and bringing the routers back online.
Actions
Network provider
Actions taken
- The equipment that had links between the two datacenters has been isolated and is now in its own network. This makes sure LSA updates aren't inadvertently sent to the second datacenter.
- An isolation timeout has been lowered from 5 minutes to 1 minute, making the system react faster to failures.
Actions planned in a few days
- Forbid any non-primary router to be elected as a leader to avoid any issue. According to their support contract with their suppliers, our network provider has officially sent a bug report to the manufacturer of the router which did not behave as expected and they are awaiting a fix and any relevant information.
- Routers will now reject LSA rules when they hit their limit instead of going into protection mode. This will allow having a degraded network at first, instead of directly having a broken network. There are currently 4 different brands of routers and each one of them will be tested separately.
- Other security measures have been taken. Additional monitoring and logs will also be added
Clever Cloud
Actions taken
- Build cache management system database interaction performance improved + database performance itself improved
- A deployment system bug with urgent queues is fixed, which allows us to deploy some applications before others (internal and Clever Cloud Premium customers)
Actions planned
- Further improve performance and resilience of the build cache management system.
- Improve the monitoring of the alerts queue, and the number of unreachable deployments being processed
- Improve the visibility of urgent alerts among a high number of alerts
- Improve the monitoring of the logs storage system
- Improve the monitoring of the connectivity between FS buckets servers and hypervisors
- Improve the monitoring of applications that should be up without having any instances
- Improve our communication on our status page to post updates more frequently
Original incident details
We are currently experiencing a network accessibility issue on our PAR zone. We are investigating.
EDIT 12:21 UTC+2: Our network provider is looking into the issue.
EDIT 12:28 UTC+2: Deployments on other zones might not correctly work. But traffic shouldn't be impacted.
EDIT 12:30 UTC+2: Network connectivity seems to be back. We are awaiting confirmation of incident resolution from our network provider.
EDIT 12:35 UTC+2: Our network provider found the issue and fixed it. Network is back online since 12:30 UTC+2. Investigation will be conducted to understand why the secondary link hasn't been used.
EDIT 12:42 UTC+2: A postmortem will be made available later once everything has been figured out.
EDIT 12:50 UTC+2: The deployment queue is currently processing, queued deployments might take a few minutes to start
EDIT 13:00 UTC+2: Logs may also be unavailable depending on the applications
EDIT 13:20 UTC+2: Deployment queue still has a lot of items, the build cache feature is currently having troubles which slows down deployments.
EDIT 14:33 UTC+2: Deployments queue is now lower but there are still some issues with some of them. Logs are also partially available
EDIT 15:30 UTC+2: The build cache feature still has troubles, we are currently working on a workaround. Logs should now be back but there is a delay in processing which might affect availability on the Console / CLI. They might be a few minutes late.
EDIT 16:04 UTC+2: Some applications linked to FSBuckets systems might have lost their connection to the FSBucket, increasing their I/O and possibly rebooting in a loop for either Monitoring/Unreachable or Monitoring/Scalability. This can cause response timeouts, especially for PHP applications
EDIT 16:16 UTC+2: Build cache should be fixed, meaning that deployments should take less time
EDIT 16:53 UTC+2: There is still a lot of Monitoring/Unreachable events that are being sent, making a lot of application redeploy for no good reason. We are still working on it.
EDIT 17:18 UTC+2: The issue with Monitoring/Unreachable events has been fixed. The size of the deployments queue should go down.
EDIT 18:07 UTC+2: Most issues haves been cleared up. PHP applications may still be experiencing issues, we are working on it. If you are experiencing issues on non-PHP applications, please contact us.
EDIT 19:05 UTC+2: All PHP applications have been redeployed. If you are still experiencing issues, please contact us. All other applications which have not already been redeployed since the beginning of the incident will be redeployed in the next few hours (to make sure no apps are stuck in a weird state).
We are experiencing issues with public reverse proxies.
EDIT 16:58 UTC: we mitigated the issues.
Reverse proxies on the Paris zone are experiencing instabilities. We are investigating.
EDIT 18:04 UTC+2: One of the reverse proxy stopped accepting new connections. It has been put out of the pool for further investigation. Stability should have been resumed since 2 minutes.
EDIT 18:18 UTC+2: Performance is back to normal. We are going to investigate further why this reverse proxy went into this state without being noticed.
Planned maintenance of the storage backend of Clever Cloud Metrics (used for access logs as well) will occur on 2021-06-15 at 11:30 UTC.
The maintenance itself should take no more than an hour. During this time, writes will be queued and reads will be partially available.
Once the maintenance is over, queued-up writes will start being ingested, reads will be available again (except for recent data until queued-up data points are ingested).
11:36 UTC: Maintenance is starting.
12:04 UTC: Maintenance is over. The ingestion pipeline is running at full speed catching up on the queued-up data.
12:18 UTC: Ingestion is caught up.
MongoDB shared cluster on Paris zone is overloaded. We are investigating what is most likely due to excessive ressources usage of some users.
As a reminder, this cluster is only used by free plans labeled "DEV". This is meant to be used for development and testing purposes only, not production.
If you are using a free plan in production, we suggest you migrate to a dedicated plan using the migration tool in the Clever Cloud console.
10:43 UTC: The cluster is working fine now although it may be slower than usual for now as a node is out of the cluster and will be re-added later.
12:23 UTC: The node mentioned in the last update has been re-added. The incident is over.
Logs are deactivated while we are investigating an issue.
EDIT 19:14 UTC: Logs should now be back to normal. Sorry for the interruption.
Dedicated load balancers for Clever Cloud's own applications (APIs, Console, website, ...) are overloaded.
We are in the process of adding capacity to resolve this issue.
14:28 UTC: Performance is back to normal.
At 11:30 UTC we started getting tickets about customer's applications not responding. We started investigating. It looks like the network or the reverse proxies are responsible for that.
EDIT 12:46 UTC: we are experiencing abnormal new connection rates on public reverse proxies.
EDIT 12:50 UTC: we found the responsible application for this new connection rate and are mitigating it.
EDIT 14:19 UTC: Load balancers have been upscaled so they can handle more traffic. Performance is back to normal since 13:12 UTC.
Logs ingestion is malfunctioning. We are investigating.
08:00 UTC: New logs are being ingested. Logs emitted during the incident will not be ingested in the main logs storage system. Log drains may start receiving (part of) the older logs, we are still investigating this part.
08:15 UTC: Looks like everything that could be ingested has been ingested. Ingestion delay may still be a little higher than normal though, it should go back to normal soon.
Some instabilities have been detected on Warsaw reverse proxies, leading to some connections unexpectedly dropped. The problem has been fixed after an upgrade of said reverse proxies.
May 2021
Our cellar-c1 cluster is experiencing connection issues. Some buckets might be unavailable as the network between various nodes of the cluster is currently having issues. We are investigating.
Cellar-c2 cluster isn't impacted.
EDIT 08:23 UTC: Connection seems to be back, we have notified both network providers used for Cellar-c1 and are still awaiting an answer. We are waiting a bit more to see if the links are correctly back or if we should expect another issue.
EDIT 08:47 UTC: The connection is now down again.
EDIT 09:35 UTC: The connection has been back up for 15 minutes and the root cause may have been found. We are waiting for explanations from our network provider. In the meantime, this issue may also have affected applications that are connecting to external services. We've seen loss to Scaleway and Azure, there might have been more.
EDIT 10:25 UTC: The issue now seems to be resolved. The root cause wasn't entirely found, current investigations show that a transit provider had an issue and traffic was redirected elsewhere, maybe leading to some links saturation (which would explain why the loss wasn't 100%, but more like 80%).
Deployments seem to be unresponsive at the time, we are investigating.
EDIT: The issue has been fixed
Warp 10 read operations are unavailable. We are working on it. Service should be back in ~2 hours.
Data is still being ingested.
09:45 UTC: Incident is over.