Incidents
Full history of incidents.
June 2018
Some databases are unreachable.
EDIT 2018-06-18 16:29 UTC: The hypervisor is up again, the databases are getting back up.
Applications that were on this HV were redeployed on another one.
We have detected some network instabilities on one of our reverse proxy of the *.cleverapps.io domain, affecting the Paris zone. Our network provider has been notified.
EDIT 15:08 UTC: We are still waiting for our network provider to find the root cause of it.
EDIT 15-06-18 13:00 UTC: Instabilities have ceased since this morning. Everything should be back to normal
One of the nodes of the shared rabbitmq cluster went down. We are bringing it back
EDIT 10:40 UTC: The node has been restarted, we continue to monitor the situation.
EDIT 13:20 UTC: The cluster has been running fine since the incident
One of our add-on reverse proxy had to be restarted following an increasing rate of connections refuse. We will continue to monitor the situation closely
May 2018
Our git repository will be shutdown for up to 15 minutes at 13:30 UTC, May 25th. Deployments will be shutdown and Git push / clone will be unavailable.
EDIT 13:30 UTC: The maintenance has begun. Deployments are shutdown (but are queued) and git repositories aren't available anymore.
EDIT 13:39 UTC: The maintenance is over, deployments and git repositories are available again
Some instances have troubles reaching VPN targets through our VPN service, we are investigating. Timeouts or unreachable routes are expected.
EDIT 09:45 UTC: We might have found why connections are hanging, we are currently doing some tests
EDIT 10:10 UTC: The tests worked fine and a fix has been deployed. All connections should have been restarted. If you still experience troubles with connecting to a particular service, please let us know at support@clever-cloud.com with the service you're trying to access
An operation maintenance is in progress on the storage backend of Metrics. Metrics are currently unavailable.
EDIT 14:40 UTC: Metrics are back since 14:15. Performance is gradually coming back to its usual level.
Deployments are having trouble to start or complete. We are working on it
EDIT 08:05 UTC: Deployments should be back to normal, we are keeping an eye on the situation.
EDIT 08:33 UTC: Some deployments still won't start
EDIT 09:00 UTC: Deployments should be back to normal again. We are still keeping an eye on the situation and cleaning up the remaining issues
EDIT 12:28 UTC: Again, some deployments are failing to finish even though they appear as successfully done in the logs. We are looking at it
EDIT 13:27 UTC: Deployments are going to be stopped to fully clean the system. It should not last more than 15 minutes. The maintenance is starting now.
EDIT 14:08 UTC: Deployments are available since 13:45 UTC. The maintenance period is over. We keep looking for everything to go back to normal
EDIT 16:30 UTC: Everything seems to be back to normal
We (or a client of us) were targeted by a DDoS attack starting at 10:05 UTC. We removed this IP from our front pool. The issue has been mitigated. We are still watching it.
At 8:13am Paris Time today, our hypervisor hv-par2-036 has been detected as unreachable.
A hard reboot has been requested to our hosting service.
Around 20 add-ons are impacted.
9:17am Paris Time: incident is fixed. All add-ons have recovered.
Network instabilities are affecting one of our reverse proxy, leading to packet / requests loss.
EDIT 13:50 UTC: Instabilities have stopped for 10 minutes now, we are still closely monitoring the situation.
The SSH Gateway asks for a password for PHP application instead of letting you connect. We are investigating the issue.
EDIT 08:00 UTC: A new version of the PHP image has been released. Redeploying your application should be enough to SSH again to the machine
We have started a maintenance operation on a component of the Metrics cluster. This operation takes more time than expected.
Until it's over, Metrics are not available. Metrics agents on scalers should push the data when the service is back.
EDIT 15:14 UTC: Metrics are back since 15:12 UTC
April 2018
A dedicated add-ons reverse proxy stopped accepting new connections at 15:28 UTC and was restarted at 15:31:30 UTC.
Traffic was back to normal at 15:32:00 UTC.
Logs drains are currently stopped, we are working on fixing this issue.
Due to a network issue, deployments are not working properly. Also, the state of the applications might be displayed wrong (grey disc instead of green one) in the console.
March 2018
Cellar is having network issues on some node. Some requests are failing, both requests to GET resources as requests to send resources.
We are investigating the problem
EDIT 19:35 UTC: The problem seems to be gone.. It may be due to a maintenance operation made on the Cellar cluster which shouldn't have caused this. This maintenance has been done multiples times without problems. We will keep an eye on the cluster when this maintenance starts again, probably tomorrow.
Multiple reports are indicating there is a network slow down for some clients. We are investigating the issue. Applications may take higher time than usual to respond
EDIT 15:10 UTC: The source of the problem is one of our customers receiving a DDoS on its application. While the infrastructure can handle such load, we detected a problem with the configuration of our reverse proxies which doesn't allow us to correctly handle the load of this DDoS. We are looking at how we can improve that. In the meantime, traffic targetting that customer's application has been blocked.
EDIT 16:45 UTC: Most of the traffic is filtered. We will continue watch the issue in the following hours
A dedicated addons reverse proxy is refusing new connections. It is being restarted.
EDIT 11:49 UTC: Incident over since 11:45 UTC
Real-time log delivery is affected by an outage on our message broker. Log drains are affected as well. Logs are still archived.
EDIT 17:03 UTC: Real-time delivery is back since 16:50 UTC