-
Notifications
You must be signed in to change notification settings - Fork 64
clearwater-cluster-manager doesn't restart Cassandra under Docker #24
Comments
I think the issue here is that |
Actually, the work-around above (of killing cassandra and restarting clearwater-infrastructure) doesn't seem 100% reliable - it worked on Homstead, but not on Homer. |
It seems sometimes necessary to kill Cassandra multiple times for the work-around to work - I'm not sure why this is. I've also noticed that stopping the deployment doesn't seem to work cleanly. |
Why is live-test-docker working reliably if we have this bug? |
Not sure, but it's not just me that's hit this - it's also been hit on the mailing list (http://lists.projectclearwater.org/pipermail/clearwater_lists.projectclearwater.org/2016-May/002954.html). Agree that we should investigate how live-test-docker differs from our documented install process as part of resolving this. (I notice that it does pass different parameters to docker, although can't immediately see how these would be significant.) |
I've successfully turned up a Docker system and made a call through it. I've used the latest versions of Docker and Compose to rule out the possibility that there's some recent regression. In other words, I can't repro this. It also looks like the mailing list user has now got Docker working (http://lists.projectclearwater.org/pipermail/clearwater_lists.projectclearwater.org/2016-May/002963.html). One issue I did hit, which might explain the issues, is that if I stop the Docker deployment and start it again, the containers were assigned different IP addresses. We don't automatically spot IP address changes and reconfigure our databases, so changing the IP address means that Cassandra can't start (because cassandra.yaml still has the old IP). (This is not specific to Docker - e.g. see https://github.com/Metaswitch/clearwater-etcd/issues/287) I'm pretty sure (having checked with @graemerobertson ) that we've never had a documented procedure for changing Clearwater's IP addresses, let alone an automated procedure - I'll make sure the product owner's aware of this, but it feels like new function, not bugfix.
|
@mirw , is it possible you stopped and then started your Docker containers? |
I don't think so, and I've just reproduced it. I'll send you the details of the box on which I was testing. |
Hmm, interesting. On your machine, ps shows this:
So:
Looking at the cluster manager log, 17:25:17 is too early for Cassandra - cluster-manager only started at that time, and didn't put a cassandra.yaml file into place until 17:25:54. But I'd expect /usr/sbin/cassandra to fail and restart until cassandra.yaml was in place. Attaching with strace, I get:
So it's waiting for a child process - -1 as the first argument means "wait for any child process" - but its only child process is marked as defunct (i.e. zombie state, should be reaped by gdb is unhelpful:
Kernel versions are different on a system hitting this bug (3.13.0-74-generic) and two system not hitting this bug (3.13.0-57-generic on docker-staging, 3.13.0-83-generic on my dev box). I can't find any relevant-looking bug reports at https://launchpad.net/ubuntu/+source/linux/+bugs though. |
Tomorrow I might try and repro this on a 3.13.0-74-generic system then upgrade it to 3.13.0-83-generic and see if that fixes it. |
Setting up from scratch with the Ubuntu Trusty AMI, ami-fce3c696, which uses 3.13.0-74-generic by default, shows the same symptoms:
Now to try with a higher kernel version. |
OK, I have:
I think that's pretty conclusive that this is an issue with the |
#25 adds clear advice not to use that kernel. |
Symptoms
Spin up a deployment under Docker using an etcd-using version (i.e. commit later than ae892d7).
Cassandra doesn't start, so the deployment is not functional.
Killing Cassandra (using
pkill -f cassandra
) and then restarting clearwater-infrastructure (using/etc/init.d/clearwater-infrastructure restart
on both Homestead and Homer seems to resolve the problem.Impact
The deployment is not functional.
Release and environment
Seen on release-98.
Steps to reproduce
Simply start up a deployment using an etcd-using version.
The text was updated successfully, but these errors were encountered: