Incident History
Full history of incidents.
April 2022
Live updates:
Some hypervisors are experiencing issues with qemu. VMs are randomly crashing.
We are investigating.
- 0323: Looks like too processes are started and systemd is kill qemu threads.
- 0330: We suspect a recent update to be causing the thread exhaustion on the HVs.
- 0345: We start applying a patch to revert the update.
- 0407: We finish checking up everything. The HVs look fine, now.
Post Mortem:
Incident summary
The 4th of April, some new deployments were unable to be completed by the CCOS (Clever Cloud Operating System) orchestrator.
A few day ago, we introduced a new notification subsystem. It was required to enable the Network Groups feature. The new notification subsystem led to new connections from hypervisors agent to be initiated to the messaging component.
An issue on the proxy layer which did not properly closed connexions, led to connexion stacking until saturation of the pooler. This situation made agents to stack up too many processes on hypervisors machines for too much time preventing new processes for being spawned.
Our hypervisor controller suffered from being able to spread new threads, which led to new deployments being unable to be completed. It also prevented the current virtual machines from spawning new threads, thus crashing some of these running VMs.
Short term resolution
Network Groups being in ALPHA, we immediately decided to rollback their availability, pushing back a non blocking version which did not rely on our messaging layer.
Long term resolution
Two different actions are being rolled out.
- The first one is a patch being currently tested on a dedicated deployment to ensure the garbage collection of connections on the messaging service proxy layer.
- The second one is targeting the hypervisor's agent with an architectural change to prevent too much processes for being spawned. A specific driver has been setup as a service to maintain a single connexion and a single process instead of spawning an on-demand process at each notification. This modification would avoid any issue regarding the messaging service, even in case of other issue than the connection handling.
March 2022
Cellar C2 is having writing and reading issues. The team is investigating.
EDIT 15:32 UTC: The team has found to origin. We are working on a fix.
EDIT 15:50 UTC: Reading is back, tu situation is being mitigated.
EDIT 16:01 UTC: Cellar C2 is up and running.
We identified issues on our metrics and accesslogs storage where certain metrics and accessLogs are not accessible.
Problem has been identified, we are working to fix the problem.
EDIT 15:36 UTC: certain metrics and accessLogs are still not accessible. EDIT 18:50 UTC: metrics and accessLogs are now accessible.
Some parts of our infrastructure are slowing down the deployments.
Our private reverse proxies (which serve our APIs) are encountering performance issues. This is slowing down API requests and parts of the deployment process.
We are trying to fix these performance issues.
We have identified issues affecting logs and drains.
EDIT 18:43 UTC: fixed.
Due to security issues in the biscuit-auth token v1. The addon pulsar cluster will be restarted with the new biscuit authentication/authorization plugins (biscuit v2.0) which have breaking changes. The related addons will have their environment variables updated accordingly so the linked applications will be redeployed automatically.
Everything went well. Do not hesitate to each us via support for any questions.
Due to an incoming maintenance operation. We disabled the addon pulsar creation.
EDIT 20:57UTC - creation is enabled.
We identified issues on our metrics and accesslogs storage. We are working to fix the problem which is currently causing some difficulties on the query-side.
Users are experiencing HTTP errors on website heptapod.host.
** UPDATE ** 2022-03-24 15:40 UTC website does not have HTTP errors anymore
As announced, cellar-c1 has been definitively shutdown.
If you lack some files that were on it, please contact the support with all the informations: add-on ID, bucket name, etc.
This maintenance concerns the migration of our cellar-c1 Cellar cluster. Affected customers have been emailed multiple times since the January regarding this service end of life.
As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.
A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.
This brownout will happen on 16/03/22 16:00 UTC for a 30 minutes window.
Our support team stays at your disposal for any questions.
We are currently having various networking issues (packet loss or slow response times) on our Paris infrastructure. We are investigating.
Some services are also impacted:
- Pulsar
- Metrics
- Access logs
EDIT 18:20 UTC: Our network provider is investigating the issue.
EDIT 18:28 UTC: The issue has been identified and has been escalated. Logs may also be impacted.
EDIT 18:44 UTC: The issue is still being worked out but Pulsar and Logs are now working fine again.
EDIT 19:26 UTC: The issue has been fixed by the network provider at 18:54 UTC. All components are now working fine again. Access logs are being ingested and may have some lag for a few hours. Sorry for the inconvenience.
We identified issues on our metrics/accesslogs storage. We are working to fix the problem which is currently causing some lags in the ingress data plane.
EDIT 12:04 UTC: The lag in the ingestion pipeline has been resolved.
This maintenance concerns the migration of our cellar-c1 Cellar cluster. Affected customers have been emailed multiple times since the January regarding this service end of life.
As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.
A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.
This brownout will happen on 18/03/22 10:00 UTC for a 30 minutes window
Our support team stays at your disposal for any questions.
EDIT 11:00 UTC: The brownout has started and will last for 30 minutes.
EDIT 11:30 UTC: The brownout has ended. The service will be decommissioned next Monday.
This maintenance concerns the migration of our cellar-c1 Cellar cluster. Affected customers have been emailed multiple times since the January regarding this service end of life.
As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.
A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.
This brownout will happen on 14/03/22 09:30 UTC for a 30 minutes window.
Our support team stays at your disposal for any questions.
EDIT 09:36 UTC: The brownout is starting. It will last for 30 minutes.
EDIT 10:07 UTC: The brownout has ended. Next one will happen on 16/03/22 16:00 UTC for a 30 minutes window.
This maintenance concerns the migration of our cellar-c1 Cellar cluster. Affected customers have been emailed multiple times since the January regarding this service end of life.
As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.
A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.
This brownout will happen on 11/03/22 14:00 UTC for a 10 minutes window.
Our support team stays at your disposal for any questions.
EDIT 14:00 UTC: The brownout is starting and will last for 10 minutes.
EDIT 14:10 UTC: The brownout has ended. Next one will happen on 14/03/22 09:30 UTC for a 30 minutes window.
This maintenance concerns the migration of our cellar-c1 Cellar cluster. Affected customers have been emailed multiple times since the January regarding this service end of life.
As a reminder, the service will be shut down on 21/03/22. A few network brownouts will be applied to remind customers that they need to migrate their data.
A total of 5 brownouts will be applied. During these planned downtime, the service will refuse any connections, be it HTTP or HTTPS.
This brownout will happen on 09/03/22 10:00 UTC for a 10 minutes window.
Our support team stays at your disposal for any questions.
EDIT 10:00 UTC: The brownout has started.
EDIT 10:10 UTC: The brownout has ended. Next one will happen on 11/03/22 14:00 UTC
Our Cellar C1 cluster service has currently connectivity issues leading to failed requests. We are investigating with our network provider the reason of those issues.
Edit: Connectivity issues has been solved by our network provider. The service should run as expected
We identified issues on our metrics/accesslogs storage. We are working to fix the problem which is currently causing timeouts on queries.
EDIT 10:27 UTC: Queries have returned to normal, Metrics and Access logs should now be reachable. We are monitoring the queries.
EDIT 11:03 UTC: Queries have returned to normal, Metrics and Access logs should now be reachable.