Incidents

Full history of incidents.

January 2023

[PAR] Two hypervisors have rebooted 3 years ago

Fixed · Infrastructure · Global

Two hypervisors have rebooted in the Paris zone. Deployments have been impacted and some applications and databases may be unreachable. We are investigating the issues.

** EDIT 13:59 UTC ** One hypervisor is up and running

** EDIT 14:52 UTC ** The second hypervisor is down due to hardware issues

** EDIT 15:22 UTC ** Applications and databases may be difficult to reach as a load balancer node is hosted on the down hypervisor

** EDIT 17:00 UTC ** Deployments may have been impacted, we are redeploying the system

** EDIT 17:30 UTC ** Hypervisor is up and running. We are cleaning up the last thing

** EDIT 18:17 UTC ** Hypervisors are up and running. All systems seems working normaly

Cellar read-only 3 years ago

Fixed · Cellar · Global

At 02:20 UTC, we started having alerts saying the Ceph pools are full. We are investigating this.

04:40 UTC, we take the decision to lower the replication ratio to let the cluster breathe.

A lot of backups failed, though. We will start them again during the day.

[Par] Some apps and Cellar are unreachable 3 years ago

Fixed · Infrastructure · Global

Several applications deployed on Paris region are not reachable. We are on it.

EDIT 21:56 UTC: we are experiencing a network connectivity issue, impacting parts of Paris region. Cellar is also impacted.

EDIT 22:01 UTC: Network connectivity is back online. Apps should be reachable. Cellar is in recovery, we are working on it.

EDIT 22:25 UTC: Cellar should be accessible. You may experience a bit more latency due to recovery processes in progress.

EDIT 22:57 UTC: Everything should be up.

[PAR] An hypervisor is down 3 years ago

Fixed · Infrastructure · Global

An hypervisor in Paris is down/unreachable. We are investigating it. You may experience some deployments issues.

Edit 3:25 pm UTC: The hypervisor is back online. All impacted applications have been redeployed. If you are experiencing an issue, please contact our support.

Metrics / access logs unavailability 3 years ago

Fixed · Access Logs · Global

We are currently facing an unavailability of the Metrics and access logs stack. The problem has been identified and we are working to bring it up.

Metrics through the console or Grafana or access logs query is currently affected.

EDIT 16:44 UTC: The service is back up, we are starting to process the backlog of events. You should now be able to query the data but it might lag a bit.

EDIT 17:01 UTC: The queue has been ingested. The service is now back to normal. Sorry for the inconvenience

Deployments are slow 3 years ago

Fixed · Deployments · Global

Deployments are currently slower than usual. They may take more time to start or complete. We are investigating.

EDIT 13:46 UTC: The slowness is now resolved since 13:35. The initial cause of the slowness has been found and we continue to monitor the situation.

Deployments are unavailable 3 years ago

Fixed · Deployments · Global

Deployments are currently unavailable and failing for unknown reasons. We are currently investigating.

EDIT 15:25 UTC: Deployments are running again. Some more operations will be done in the next few minutes to stabilize the situation. In the meantime, we continue to monitor the health of the deployment system.

EDIT 15:45 UTC: The incident is now over. If you still have troubles deploying your application, please reach out to our support team. Sorry for the inconvenience.

Clock skew on Cellar C2 (Paris zone) 3 years ago

Fixed · Cellar · Global

Cellar C2 is having issue with time sync. It may result with a "ClockTooSkewed" error when you try to list or access files.

We are working on fixing the clocks on the Ceph monitoring servers. (Ceph is the software we use to provide the Cellar service.)

EDIT 12:40 UTC+1: One of the reverse proxies in front of the Cellar system was desynchronized. This proxy is now out of the pool for further investigation and the issue should now be fixed.

[Paris] Network instabilities 3 years ago

Fixed · Infrastructure · Global

We are currently experiencing network instabilities on the Paris zone. Our network provider is aware of the issue and we are currently awaiting for more information. One instability was detected at 9:42 UTC+1. No other since then.

Our metrics and access logs stack is currently unavailable, we are working towards bringing it back up.

Update 9:55 am UTC: Metrics and access log storage is now up. We are catching up the lag

Update 14:33 UTC: The lag of the Metrics and access logs platform is now resolved. Regarding the network instabilities, our network provider identified the issue and is working towards resolving it. It may take a few hours to get back to a nominal situation. We did not see any other instabilities since this morning.

Update 15:59 UTC: Another network issue happened at 15:50 UTC and lasted for ~1 minute, parts of the Paris zone was unreachable during that time.

Update 23:11 UTC: No other incident has been seen, we are still waiting for our network provider to ensure that the issue is resolved on their end.

Update 2023-01-09 14:18 UTC: We've seen two new events, one at 13:23 UTC and another at 14:14 UTC. We notified our network provider. Those may be related to the same problems we've seen last week.

Update 2023-01-09 19:47 UTC: Those two events weren't linked to the ones seen last weeks. The reason has been identified by the network provider and has been fixed. We are still waiting for confirmation of resolve on the original issue.

API authorization is failing 3 years ago

Fixed · API · Global

Some part of the authentication process in the API is failing. We identified the error and are trying to fix it.

December 2022

Maintenance on Git repositories servers 3 years ago

Fixed · Global

The Git repositories servers on all of our zones will go under maintenance today at 18:00 UTC+1. This maintenance will have the following impacts:

Delayed git repository creation for newly created applications
Delayed add or removal of SSH keys authorized to interact with the git repositories

GitHub applications will not be impacted.

During the maintenance, you will be able to continue to push your updates as well as do deployments. The maintenance is expected to last up to 1 hour. If you have any questions, please reach out to our support team.

EDIT 18:01 UTC+1: The maintenance is starting.

EDIT 18:35 UTC+1: The maintenance is now over. Thanks for your patience.

Maintenance on core API 3 years ago

Fixed · API · Global

Due to an issue with our core APIs, we are doing urgent maintenance. It should take 15 minutes. Deployments will be blocked during this time. Applications will keep running.

EDIT 13:45 UTC - done.

Dedicated load balancer issue 3 years ago

Fixed · Infrastructure · Global

Following a maintenance, few servers that host some dedicated load balancers are seen unreachable by the monitoring

EDIT: 2022-12-20 19:47 UTC : During the recovery process some services goes down with tls issues

Paris: an HV is down 3 years ago

Fixed · Infrastructure · Global

An hypervisor in Paris is not responding. We are investigating.

Update 4:16 AM UTC: HV is now up. We are running the cleanup tasks associated to the HVs

Update: 4:54 AM UTC: Cleanup is over.

Paris: networking issues 3 years ago

Fixed · Infrastructure · Global

One of our Paris datacenter encounters network issues. Some servers were unreachable for about 1 minute at 10:09 UTC. Services (applications, add-ons, Cellar, ...) hosted on those servers were partially unreachable or fully unreachable during that time (depending on the scalability or replication of those services).

The cause has been identified and a solution is currently being investigated. This incident will be updated as soon as we have more information.

EDIT 13:44 UTC: Another network interruption happened at 13:01 UTC. A fix is currently being tested.

EDIT 14 Dec 2022 15:55 UTC: The fix appears to be working as expected. This incident is now over.

Paris: Monitoring issues on some applications 3 years ago

Fixed · Infrastructure · Global

Some applications are having troubles being monitored by our systems and get redeployed with the Monitoring/Unreachable reason even though they are still available. We are investigating the cause of the issue.

EDIT 15:55 UTC: The cause has been found. This issue only affects applications tied to a unique IP proxy service. The issue has been mitigated in the last minutes and we are working to fully fix it.

EDIT 16:20 UTC: The issue has been fixed and should not happen again. If you encounter weird Monitoring/Unreachable deployments, feel free to contact our support team.

Deployment issues 3 years ago

Fixed · Deployments · Global

Our distributed deployment message queuing system lost a broker which has lead to some lags during the consumption and a few components unable to reconnect properly to the cluster.

EDIT: 18:30 UTC - all systems are up

November 2022

Read/write latency on Cellar 3 years ago

Fixed · Cellar · Global

We are experiencing high read-write latency on Cellar. We are working on it.

EDIT 24 of november 9:33 UTC: Balancing is over.

Deployment lag 3 years ago

Fixed · Deployments · Global

Some deployments lag

A monitoring desynchronization causes disturbance on deployments. We are investigating the troubles. We are manually cleaning unnecessary deployments

We have to clean some stuck deployment but system is now recovered

Pulsar production issue 3 years ago

Fixed · Pulsar · Global

The storage layer does not allow any writes more. We are scaling the Pulsar storage capacity