During the deployment process,detected ZooKeeper failure cause of network problems,auto recover the connection, the deployment progress often got stuck.the log information as follows:
2015/03/06 09:16:18.298 INFO [ClientCnxn] Client session timed out, have not heard from server in 3333ms for sessionid 0x24b74a0ac43f380, closing socket connection and attempting reconnect
2015/03/06 09:16:18.399 WARN [GroovyLangUtils] Detected unexpected exception [ignored]: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /opstools/adp/glu/agents/fabrics/pspt-web-qq/instances/10.68.189.248
2015/03/06 09:16:18.399 WARN [AgentMain] Detected ZooKeeper failure.
2015/03/06 09:16:18.463 INFO [ClientCnxn] Opening socket connection to server 10.208.10.105/10.208.10.105:2181. Will not attempt to authenticate using SASL (unknown error)
2015/03/06 09:16:18.495 INFO [ClientCnxn] Socket connection established to 10.208.10.105/10.208.10.105:2181, initiating session
2015/03/06 09:16:18.528 INFO [ClientCnxn] Session establishment complete on server 10.208.10.105/10.208.10.105:2181, sessionid = 0x24b74a0ac43f380, negotiated timeout = 5000
2015/03/06 09:16:18.627 INFO [AgentMain] Syncing filesystem <=> ZooKeeper
2015/03/06 09:16:19.308 INFO [AgentMain] Connected to ZooKeeper.
There is another problem,during the deployment process,A machine(agent host) crashed or has been lost connection, led to the whole deployment progress got stuck,no any timeout response is received.
There are many reasons why the agent would loose the connection with ZooKeeper. You may want to check if there was an issue with the network. You may also want to check the GC log files both on the ZooKeeper front and on the agent side. There has been report in the past of loosing the connection because the app is "frozen" doing some GC and so the timeout expires. That being said it is usually in the console not the agent. That is why I would suggest checking your network infrastructure.
The other issue with the deployment progress stuck is unclear as you did not provide much information about it. Did you try to abort? Was the console frozen? etc...
I am sure that is network problem about zookeeper disconnect.
I mean during the deployment, if the network fluctuation, prompted the zookeeper disconnect and reconnect, they will affect the progress of the entire deployment, cannot continue deploy, of course progress will also cease.Another case can also lead to deployment progress stuck, During the deployment process, if Glu Agent stopped, or is to deploy the Agent of the machine crashed, can also lead to deployment is frozen.I think if I can have a way to ignore the deployment machine Agent, rather than has been waiting for, there is no timeout.
I wonder if I have described all of that clearly enough. I'm not good at english ,I hope that you can understand. thanks.
I don't think it is a problem of the language as I can understand what you say. I think the issue is that I cannot really understand what is going on from what you describe as there is not enough information.
You are saying: if glu agent stopped it freezes the deployment. I understand the "glu agent stopped". I don't understand what you mean by "it freezes the deployment". Is the console not responding anymore? If you could reproduce the problem and take a screenshot of what you are describing maybe that would help. Can you abort the deployment? I really need to understand what "freeze" means to know whether it is a bug or not. And if you have exact steps to reproduce that would be great as I can try to reproduce on my end
I can guess what the case is here, since we have observed that a few times (rarely).
When orchestrating a deployment the console triggers events to the agents, and wait for each agent to complete each step. The console relies entirely on the agents to implement a timeout on a step (I've seen deployments 'locked' on the same phase for 11 days on a lesser watched fabrics).
For example the console deploys on 3 agents sequentially (lets assume there is only one step):
- console tells agent-a to deploy
- agent-a receives the request and starts the deployment
- console checks zookeeper
- agent-a reports success to zk
- console detects response
- console moves on to the next agent
- console tells agent-b to deploy
- agent-b receives the request and starts the deployment
- console checks zookeeper
- for some reason the agent never reports back as done to zk (examples we experienced: agent crashes, the groovy script calls an interactive prompt which wait for yes/no answer, the start step is buggy and does not launch the target application in the background, etc...)
- the console keeps on waiting for a status update on zk for agent-b forever until someone manually aborts
So the issue here is that the console is entirely reliant on the agents with regards to timeouts. If the agent has crashed, it should probably assume that since the agent is no longer reporting to mark that step as failed after a specific amount of time (many minutes to allow someone to manually relaunch the agent if needed). It should also have an other timeout, even longer, since getting stuck on 'starting' for 10 days does not make sense even if the agent is still alive. So bottom line, there should be a configurable way to tell the console to decide to abort a step after a predetermined timeout, 2 cases are: agent has disapeared, and step is not completing for hours.
By default it will retry 10 times (configurable).. but eventually a TooManyRetriesAgentException will be thrown
If this is not working then please submit a ticket for it, as it should be working...
b) if an agent is stuck like you describe (waiting for a prompt or whatever), then the deployment will not complete, BUT it can be aborted. Putting an arbitrary time after which the console will fail the deployment is arbitrary and will not fix the issue. If you look at the deployment page you will see that a deployment is still happening. And you can decide whether the amount of time is too much for this kind of deployment at which point you can abort it, but before aborting, you can actually investigate why it is not completing. If the console aborts without intervention, then you would not have a chance to investigate while it is happening. If your issue is that you are not looking at the deployments page and you are afraid of missing it, I suppose we could implement some (configurable) duration that would trigger the console to display an alert on the Dashboard page for example, something like: "deployment xxx taking longer than yyy..." in big red letters...
a) Oh I remember TooManyRetriesAgentException ... I don't think I've seen it in the past 2 years because the agents never really go down and we auto restart them with the help of upstart.
b) I agree that any timeout desired does not belong at the console level but at the groovy script level. That said, we use Glu for entirely unattended deployment of several fabrics (human rarely look at them) and sometimes I which the console had an option to set a hard timeout. I do not want the timeout for the production fabrics but in the case of the ones for our QA and Dev teams, anything that takes more than 12hs is definitely a 'mistake' and I would rather have these auto-abort.
This is also the case with the 'commands' tab where I have sometimes confused 'new users' that run 'ping' or 'top' leave the console and the processes run for weeks.
In any case, I do not see this as a big ticket feature.
>>> This is also the case with the 'commands' tab where I have sometimes confused 'new users' that run 'ping' or 'top' leave the console and the processes run for weeks.
That is an interesting use case. I suppose the console could by default abort the command if nobody is watching it for some time unless you mark the command as ok to be running forever (a new selecting box in the UI)