Incidents

Full history of incidents.

Oldest first

May 2021

Multiple Paris hypervisors unreachable 4 years ago

Fixed · Infrastructure · Global

Multiple hypervisors in the Paris zone are unreachable. We are investigating.

14:52 UTC: Network issue is resolved. We are assessing the damage.

15:07 UTC: API and deployments are down. We are cleaning everything and bringing it up.

15:20 UTC: API is back. Deployments are back but have a significant delay as of now.

15:42 UTC: We are still working on this. Deployments are quicker now but not yet back to normal.

16:02 UTC: This incident is over. If you are still experiencing issues, please contact us.

Post-mortem

A maintenance operation carried out by our network provider a few hours before this incident generated a faulty BGP announce. Because of this, a significant portion of traffic coming out of our Paris infrastructure was going out via a NYC peer causing significant delay and even timeouts.

Routers in one of our Paris datacenter were heavily impacted by this issue and failed to accept configuration fixes. After multiple attempts to fix this, our provider ended up power-cycling affected routers which caused most of our hypervisors in this datacenter to be cut off from the rest of the network for 3 minutes.

Corrective actions will be taken to prevent this from happening again (BGP filters, dedicated admin network for the routers which was already scheduled to be set up in a few days). We will also make sure that we are warned in due time if a significant network configuration/hardware issue occurs.

PostgreSQL shared cluster upgrade 4 years ago

Fixed · Global

Following https://www.postgresql.org/about/news/postgresql-133-127-1112-1017-and-9622-released-2210/, our PostgreSQL shared clusters will be upgraded to the latest minor version of their branch.

Affected clusters are:

postgresql-c4: Paris zone
postgresql-c5: Montreal zone

This update may affect performances of the databases and their availability.

The upgrade will start in a few minutes. This maintenance will be updated accordingly

EDIT 18:28 UTC+2: Montreal cluster is now up-to-date

EDIT 19:54 UTC+2: Paris cluster is now up-to-date but postgis extension is currently broken due to the update. We are working on a fix

EDIT 20:27 UTC+2: Paris cluster: databases are currently being migrated to a newer version of postgis. It will take a few hours to run on all of the databases

EDIT 20:42 UTC+2: This maintenance is now considered as over

Hypervisor reboot 4 years ago

Fixed · Global

An hypervisor needs to be rebooted. Customers that are impacted will shortly receive an email and add-ons that can be migrated will be migrated before the reboot. Estimated downtime is about 15 minutes.

Add-ons will start being migrated at 20:30 UTC+2. Hypervisor will be rebooted at 21:30 UTC+2

EDIT 20:36 UTC+2: Maintenance is starting. Applications are getting redeployed and add-ons are starting their migrations

EDIT 21:30 UTC+2: Add-ons that could be migrated have been migrated, applications have been redeployed. Server will now reboot

EDIT 22:00 UTC+2: Server has finished its reboot, add-ons that weren't migrated should have been reachable since 21:45 UTC+2. The maintenance is over.

Hypervisor unresponsive in PAR zone 4 years ago

Fixed · Infrastructure · Global

A hypervisor became unresponsive in PAR zone. It's currently rebooting.

Affected applications are being automatically redeployed. Affected addons are unreachable.

21:53 UTC: The hypervisor is back online and is starting addon VMs.

21:55 UTC: All addons are back online. The incident is over.

Metrics/AccessLogs are experiencing issues 4 years ago

Fixed · Access Logs · Global

Metrics/AccessLogs queues are being consumed. Recent data values are currently unavailable.

06:30 UTC: Incident is over.

Core services are experiencing issues 4 years ago

Fixed · API · Global

Core services (console, API, metrics, access logs) are experiencing issues. We identified the problem and are working to resolve it.

EDIT 23:02 UTC: the incident is related to one of our hypervisors.

EDIT 23:03 UTC: we restarted the hypervisor; related databases are down.

EDIT 23:04 UTC: hypervisor is up; VMs are starting.

EDIT 23:13 UTC: metrics are down too.

EDIT 23:25 UTC: databases are up. We are now experiencing issues with our internal reverses proxies and console and API are not available.

EDIT 23:30 UTC: we queued the linked applications for a high-priority redeploy to ensure they reconnect to their databases. Core services are still partially down.

EDIT 0:00 UTC: all applications are redeployed.

EDIT 02:56 UTC: we are still working to fix issues on our internal core services (console, API); users applications/addons are not impacted.

EDIT 03:30 UTC: internal core services are back!

HBase cluster supporting metrics down 4 years ago

Fixed · Access Logs · Global

(All times in UTC) At 22:50 we got an alert saying access logs stopped being consumed. At 22:53 we got alerts saying hbase region servers went down.

After investigation, the hadoop namenodes were all in standby. At 23:33, after various checks, we promote one back to active. We then restarted all the hbase regionservers, then waited for the cluster to balance and heal up.

At 00:04 we restart the warp10 stores. At 00:07 everything is back to normal.

April 2021

An add-on reverse proxy restarted 4 years ago

Fixed · Reverse Proxies · Global

An add-on reverse proxy was restarted because of a very high load. Applications connected to that proxy may have lost connections to their add-ons. An upgrade of that proxy was planned in a few weeks to avoid any chances of high load. Other proxies were already upgraded. The upgrade will be done in the next couple of days, the proxy being now outside of the pool.

Internal 500 error when using the console 5 years ago

Fixed · API · Global

We are currently experiencing issues with the API following an update, we are rolling back to fix the issue.

16:13 - Rollback was successfully executed and everything is back to normal.

Cellar slowed down and partially unavailable 5 years ago

Fixed · Cellar · Global

We are investigating an issue with the Paris Cellar cluster.

17:33 UTC: The issue has been resolved. It was due to a partial upgrade (in progress) of the cluster. Upgraded nodes have been downgraded.

18:08 UTC: The upgrade was in-progress to fix the security issue labelled as CVE-2021-20288. Due to the large number of machines, some of them were not yet up-to-date, which have led to the issue we were facing. Some of the machines were unable to authenticate correctly, leading to a cascading failure of multiple machines that weren't yet patched. Another strategy will be used to continue the upgrade of the cluster.

Deployment issues starting or completing 5 years ago

Fixed · Deployments · Global

We currently have some issues with deployments. We are investigating.

Edit 22:48 UTC: The deployments should be fine since 22:30, we just made sure that everything was okay. Deployments that were stuck were restarted, those who failed can now be restarted without any issue. Sorry for any inconvenience.

Logs Drains 5 years ago

Fixed · Services Logs · Global

Logs drains are temporarily unavailable.

EDIT 14:37 UTC - fixed.

March 2021

Retroactive: 50% of TLS connections dropped on one of the HTTP/2 beta reverse proxies 5 years ago

Fixed · Reverse Proxies · Global

Around 50% of TLS connections made to one of the HTTP/2 reverse proxies were dropped indicating a lack of certificate. The issue's origin was a misconfiguration of this reverse proxy. Additional checks have been put in place to prevent this from happening again.

The error started at 12:23:36 UTC and stopped at 12:46:50 UTC, lasting around 23 minutes.

Metrics / AccessLogs storage issues 5 years ago

Fixed · Access Logs · Global

We are experiencing issues with our metrics/accesslogs storage cluster.

EDIT 13:21 UTC - fixed.

Metrics & AccessLogs fetch are experiencing issues 5 years ago

Fixed · Access Logs · Global

Metrics & AccessLogs querying components are temporarily unavailable.

EDIT 17:03 UTC - fixed.

Liar proxy service is unavailable 5 years ago

Fixed · Global

Our Liar Proxy hosted on OVH is currently unavailable since 01:23 UTC+1. The incident on OVH side is http://travaux.ovh.net/?do=details&id=49473& but the Strasbourg (sbg) zone seems to be having a more general issue: http://travaux.ovh.net/?do=details&id=49471

We'll update this incident in the morning. Until then, if OVH fixes the issue before that, the liar proxy should recover network access.

EDIT 11:06 UTC+1: This service is in SBG1 which is currently impacted by the fire that took place in SBG. It may take several days to come back online depending on how possible it is to order new servers at OVH. If you are a user of this service, please contact us on the support if you have any questions.

RBX front reverse proxies DOWN for 12 minutes 5 years ago

Fixed · Reverse Proxies · Global

The 2021-03-07 at 19:40 UTC websites on the RBX went down. We started investigating the issue at 19:45 and saw the RBX reverse proxies were not accepting new connections. We restarted them and everything went back to normal by 19:54.

The culprit was a badly configured NOFILE limit on the RBX reverse proxies. We updated the setting accordingly.

Afterwards: We investigated all the reverse proxies on all the zones to make sure the NOFILE limit was correctly configured everywhere. We updated the reverse proxy software (sozu) to refuse to start when given too few NOFILE. We updated the sozu package to enforce the right NOFILE value upon installation.

Unexpected issue with a core component of the Metrics system 5 years ago

Fixed · Access Logs · Global

We experienced an unexpected issue with a core component of the Metrics system.

The service is completely unavailable at the moment. We are working on it.

08:50 UTC: The faulty component is working. We are working on bringing everything back up.

08:59 UTC: Everything is back up. The ingestion pipeline is catching up.

09:07 UTC: The incident is over.

Investigating issues with our core API 5 years ago

Fixed · API · Global

We are investigating issues with our core API.

EDIT 21:07 - fixed.

PAR: FS Bucket Migration 5 years ago

Fixed · Global

Some FS-Bucket add-ons will need to be migrated to a different server for security reasons. During this migration, the Buckets will be in Read-Only mode. Any attempt to create or update a file on the add-on will fail, including for FTP operations. Errors related to Read-only file system are expected during this migration.

The migration is expected to last at most 1 hour. All impacted applications will be redeployed during the migration. After the deployment, application will be able to write to the bucket. Read operations will not be impacted.

EDIT: This maintenance has been postponed to 15:00 UTC+1

EDIT 15:00 UTC+1: The maintenance is starting

EDIT 15:02 UTC+1: The buckets are now read-only

EDIT 15:14 UTC+1: Starting now, you can redeploy your applications if you want to regain write access early. Otherwise, affected applications will be redeployed automatically in the upcoming hour, starting with applications of Clever Cloud Premium customers

EDIT 17:14 UTC+1: The deployment queue finished one hour ago, everything has been working fine so far. This maintenance is over