Clever Cloud Status

Incidents

Full history of incidents.

Oldest first

May 2022

Fixed · Access Logs · Global

Metrics and access logs are currently having ingestion and query issues. We are working on it.

EDIT 12:11 UTC: Metrics and logs are now accessible again. Sorry for the inconvenience.

Fixed · Access Logs · Global

Metrics and access logs are currently having ingestion and query issues. We are working on it.

EDIT 10:56 UTC: Metrics and logs are now accessible again. Sorry for the inconvenience.

April 2022

Fixed · Services Logs · Global

Live logs are unavailable EDIT: an internal service was unreachable, Live Logs system is now fully operational

Fixed · MongoDB shared cluster · Global

A mongodb node was unreachable. This node is now fixed

Fixed · Access Logs · Global

Metrics and access logs are currently having ingestion and query issues. We are working on it.

EDIT: Ingestion fixed, query almost restored

Fixed · Services Logs · Global

Logs pipeline components lost their connection EDIT: Connection issue fixed, we are consuming logs queue lag. we lost almost30 min of logs. EDIT: Lag consumed.

Fixed · Access Logs · Global

We are facing an issue with the indexes which result on some metrics and access logs unavailability at egress-level. EDIT: all indexes has been rebooted

Fixed · Access Logs · Global

Metrics and access logs are currently having ingestion issue. We are working on it.

EDIT 13:20 UTC: The issue has been fixed. Some metrics data points have been lost. Access logs are being queued for ingestion again.

Fixed · API · Global

Some of API calls might return a 504 error. The source cause has been found and we are working to restore the service.

EDIT 17:20 UTC: The service has been fully restored. Sorry for the inconvenience.

Fixed · Services Logs · Global

We have identified issues affecting logs and drains. We are working on it.

EDIT 06:45 UTC: fixed.

EDIT 07:22 UTC: we have identified another issue.

EDIT 09:45 UTC: fixed.

Fixed · API · Global

We are currently experiencing instabilities with our main API. We are looking into it.

EDIT 15:12 UTC: This seems to be back to normal. We did not find the root cause but we keep looking. Some actions may have failed like deployments, git push or accessing the dashboard / using the CLI in general

EDIT 17:34 UTC: We still see some instabilities, resulting in various longer queries or even errors from some services that fail to contact our API. We are still working on identifying the root cause.

EDIT 20:34 UTC: We didn't see any more instabilities since the latest status update. We'll continue to monitor the activity in the next couples of days.

Fixed · Access Logs · Global

Access logs currently have a few hours of ingestion delay. It is currently being resolved and the delay should be back to normal in a few hours. This impacts the retrieval of access logs using the CLI or the API. Also, the various console dashboards (status codes, requests per hours, ...) are impacted and might display out of sync data. Sorry for the inconvenience.

EDIT 20:43 UTC: The delay has now resolved, you should now be able to query the access logs using the CLI or API.

Fixed · Services Logs · Global

We have identified issues affecting logs and drains.

EDIT 15:05 UTC: fixed.

Fixed · Infrastructure · Global

SYD zone is unreachable. We are investigating.

EDIT 18:23 UTC - the SYD zone (provided by OVH) seems only reachable using the OVH network

EDIT 18:30 UTC - we are waiting for our provider's feedback

EDIT 19:00 UTC - fixed https://network.status-ovhcloud.com/incidents/j5vzf90dpzcc

Fixed · Access Logs · Global

Metrics and access logs are currently delayed. Data points are queued and will be processed as soon as possible. This may lead to some series missing recent data.

Edit 10:27 UTC: The delay is now resolved. Sorry for the inconvenience.

Fixed · Reverse Proxies · Global

An add-on reverse proxy on the PAR zone was unreachable for 15 minutes. The restart initially failed, ence the extended downtime.

This should now be resolved. The 7 other reverse proxies were working as usual.

Fixed · RabbitMQ shared cluster · Global

Deployments are broken.  We are looking why.

0740: The reason has been found and it's been fixed.

Fixed · Access Logs · Global

We identified issues on our metrics and accesslogs storage where certain metrics and accessLogs are not accessible.

The team has found to origin. We are working on a fix.

Fixed · Infrastructure · Global

Live updates:

Some hypervisors are experiencing issues with qemu. VMs are randomly crashing.

We are investigating.

  • 0323: Looks like too processes are started and systemd is kill qemu threads.
  • 0330: We suspect a recent update to be causing the thread exhaustion on the HVs.
  • 0345: We start applying a patch to revert the update.
  • 0407: We finish checking up everything. The HVs look fine, now.

Post Mortem:

Incident summary

The 4th of April, some new deployments were unable to be completed by the CCOS (Clever Cloud Operating System) orchestrator.

A few day ago, we introduced a new notification subsystem. It was required to enable the Network Groups feature. The new notification subsystem led to new connections from hypervisors agent to be initiated to the messaging component.

An issue on the proxy layer which did not properly closed connexions, led to connexion stacking until saturation of the pooler. This situation made agents to stack up too many processes on hypervisors machines for too much time preventing new processes for being spawned.

Our hypervisor controller suffered from being able to spread new threads, which led to new deployments being unable to be completed. It also prevented the current virtual machines from spawning new threads, thus crashing some of these running VMs.

Short term resolution

Network Groups being in ALPHA, we immediately decided to rollback their availability, pushing back a non blocking version which did not rely on our messaging layer.

Long term resolution

Two different actions are being rolled out.

  • The first one is a patch being currently tested on a dedicated deployment to ensure the garbage collection of connections on the messaging service proxy layer.
  • The second one is targeting the hypervisor's agent with an architectural change to prevent too much processes for being spawned. A specific driver has been setup as a service to maintain a single connexion and a single process instead of spawning an on-demand process at each notification. This modification would avoid any issue regarding the messaging service, even in case of other issue than the connection handling.

March 2022

Fixed · Cellar · Global

Cellar C2 is having writing and reading issues. The team is investigating.

EDIT 15:32 UTC: The team has found to origin. We are working on a fix.

EDIT 15:50 UTC: Reading is back, tu situation is being mitigated.

EDIT 16:01 UTC: Cellar C2 is up and running.