Incidents
Full history of incidents.
February 2022
We are currently having connections issues toward Scaleway infrastructure from one of our datacenters in Paris. We are investigating this issue with our network providers.
EDIT 23:15 UTC: The connectivity is now back since 23:07 UTC with our network provider saying that the issue has been resolved. This incident is now closed on our end. Sorry for the inconvenience.
Our Cellar C1 cluster service is currently unreachable by our Paris infrastructure, leading to failed requests. This cluster is the old cluster, with either domains cellar-c1.clvrcld.net or cellar.services.clever-cloud.com. We are investigating the issue.
EDIT 22:42 UTC: After a quick investigation, only one of the 3 IP that is serving those domains is having troubles reaching other nodes of the cluster. The IP has been dropped from the DNS. Meanwhile, we are investigating the issue with our network provider.
EDIT 22:39 UTC: Lowering the severity to Performance Issues. Ticket has been open with our network provider.
EDIT 23:15 UTC: The connectivity is now back since 23:07 UTC with our network provider saying that the issue has been resolved. We will wait a bit before adding back the IP of the faulty node in the DNS just to be sure but this incident is now closed on our end. Sorry for the inconvenience.
Generation of certificates for newly added domain is currently delayed due to a rate limit issue. A fix has been issued on our end and the situation should come back to normal in a few hours.
EDIT 19:16 UTC+1: This does not impact renewal of certificates.
EDIT 19:36 UTC+1: We are now under the rate limit, newly added domains should have their certificates generated in a few minutes, as usual. Sorry for the inconvenience.
We need to perform a maintenance operation on the free PostgreSQL shared cluster of the Paris zone. A fail-over will be initiated and applications may have troubles connecting to the new leader. Make sure to restart them if needed.
The fail-over will be done in the upcoming hour.
EDIT 15:17 UTC: The cluster will be fail-over in the next few minutes. Some queries might be failing as soon as the leader goes down and until your application correctly connect to the new leader.
EDIT 15:28 UTC: The fail-over has been done. Make sure to restart your applications if they can't connect to their add-on.
We are currently having difficulties with the logs pipeline. This impacts live logs in the Console / CLI as well as drain logs. We are working on it.
EDIT 14:25 UTC: The issue has been fixed. A fix has been scheduled for deployment this afternoon which should reduce those delivery issues events. We will monitor the fix closely once it gets deployed.
We are currently having difficulties with the logs pipeline. This impacts live logs in the Console / CLI as well as drain logs. We are working on it.
EDIT 15:07 UTC: Live logs and drains are back. Some drains logs may have been lost during the recovery process. Sorry for the inconvenience.
EDIT 15:52 UTC: Live logs and drains are down again, we are looking into it
We identified issues on our metrics/accesslogs storage. We are working to fix the problem which is currently causing timeouts on queries.
EDIT 15:52 UTC: Queries have returned to normal, Metrics and Access logs should now be reachable.
Metrics and access logs are currently having an ingestion issue. Some metrics points will be lost, access logs will be kept and ingested at a later time.
Ingestion is now starting at full capacity again. There will be some delay before having up-to-date access logs but it should be good in a few hours. Sorry for the inconvenience.
The log drains infrastructure went down last night (2022-02-14 around 5 AM) and some drains were lost / are broken.
We are still identifying which ones are broken to restart them. If you see that your drains are broken, please contact the support so we can restart them!
Edit 15:11 — we restarted all drains to be sure. Edit 16:27 — Most of the drains are still broken. We are trying to fix the issue by deleting and re-creating message queues in the logs infrastructure. Edit 16:37 — Deleting and creating back everything seems to have cleaned up the situation. Drains seem to be working again!
Logs and logs drains are experiencing issues. We are investigating.
EDIT 20:32 UTC - fixed.
We identified issues on our metrics/accesslogs storage. We are working to fix the problem which is currently causing timeouts on queries.
EDIT 17:45 UTC: The incident is over, sorry for the inconvenience.
January 2022
We are experiencing issues with logs collection and distribution (drains included).
EDIT 20:27 UTC: We identified the issue, and the resolution is on going.
EDIT 20:54 UTC : Fixed.
Logs ingestion is down. We are looking into it.
20:39 UTC: The ingestion pipeline is back for now but the underlying issue is not properly fixed yet.
20:49 UTC: Theoretically, the problem is fixed. In any case, the ingestion pipeline is working at full speed. We are keeping an eye on things.
A network incident between our two Paris datacenters occurred at 15:58 UTC and lasted for 55 seconds (with a few seconds where it was back during that time window).
We have dealt with msot consequences of that downtime, we are still working on fixing an issue with the ingestion pipeline of Metrics and access logs. There will be some delay.
16:40 UTC: Everything is working as expected, delay will go back to normal soon.
Metrics and Access logs queries are currently unavailable. Data is still ingested, only queries are impacted. ETA for resolution is 18:00 UTC.
This impacts:
- Metrics (Grafana, in the console or using or API)
- Access logs: (requests tiles for an organization / application, CLI access or our API)
EDIT 17:30 UTC: Everything is back to normal. Sorry for the inconvenience.
We are currently experiencing accessibility and delay issues on logs. We are working on it.
EDIT 21:58 UTC: Everything should be back to normal, sorry for the inconvenience.
We will migrate our support tool from Intercom to Crisp, this migration will impact your ongoing tickets with our support team, you will still be able to pursue them by replying to the e-mail transcripts but you can also change the recipient e-mail address to console+intercom@clever-cloud.on.crisp.email.
The migration will start at 19:00 UTC+1 and should apply instantly as soon as you refresh the console.
During the transition, you can directly contact us at supportmail@clever-cloud.com.
EDIT 20:44 UTC+1: The migration has ended, our new support tool is now ready to be used! Make sure to refresh the web console.
December 2021
We need to do an emergency maintenance on one of our core components. This might impact deployments on all zones. Applications already deployed won't be impacted. The maintenance is starting right now.
17:26 UTC: The maintenance operation did not fix the issue. Deployments are completely disabled at the moment. We are investigating.
17:31 UTC: It was DNS (DNS reverse resolving was too slow when opening connections, which timed out). We are working on bringing everything back up.
17:52 UTC: Everything is back up. If you are experiencing an issue, please contact us.
We have experienced two network issues between the two datacenters of the PAR zone:
- Between 09:28 UTC and 09:30 UTC
- Between 09:40 UTC and 09:41 UTC
We do not have any details about this incident as of now.
You may be receiving duplicate Slack notifications.
This is due to an issue with Slack. Slack is replying with 500 errors to our notifications even though they are clearly processing the messages just fine, our notification system sends multiple retries after receving failures so you will be receiving multiple duplicates and your webhooks will probably be disabled automatically (as they are after too many repeated failures). We will be re-enabling them once the issue is fixed. If your webhook remains disabled, please contact us.
14:17 UTC: We have not received a single 500 error from Slack in 8 minutes. It looks like this may be fixed. Although a broader incident is still ongoing on Slack's end: https://status.slack.com/2021-12/a17eae991fdc437d
14:44 UTC: Webhooks disabled since 12:00 UTC have been re-enabled. Slack status says messaging/notifications part of the incident is resolved, we are not seeing any errors so this incident is now over. If you are experiencing an error or if your webhook has not been re-enabled, please contact us.