NoSuchAgentException between uninstall and install in glu 4.7.1

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

NoSuchAgentException between uninstall and install in glu 4.7.1

papercrane
We used to get this much more often, some updates made it much much better, but it's still happening.

In this case, it happened between "Uninstall script" and "install script"

Uninstall script for [/ed/persist] on [i-70ba6442] - 9s
Install script for [/ed/persist] on [i-70ba6442] - 8ms
[+] org.linkedin.glu.orchestration.engine.agents.NoSuchAgentException: i-70ba6442

It looks like this was caused by a disconnection from zookeeper during the uninstall step:

...
May 13 14:27:36 localhost 2013/05/13 14:27:36.584 INFO [/ed/persist] Unconfiguration complete.#012 [glu-agent]
May 13 14:27:36 localhost 2013/05/13 14:27:36.706 INFO [/ed/persist] waitForState([state:installed, timeout:10s, mountPoint:/ed/persist]): true#012 [glu-agent]
May 13 14:27:36 localhost 2013/05/13 14:27:36.873 INFO [/ed/persist] clearError#012 [glu-agent]
May 13 14:27:36 localhost 2013/05/13 14:27:36.975 INFO [/ed/persist] executeAction([action:uninstall, mountPoint:/ed/persist, actionArgs:[encryptionKeys:[*** MASKED ***]]]): ......#012 [glu-agent]
May 13 14:27:37 localhost 2013/05/13 14:27:37.097 INFO [/ed/persist] Uninstall.#012 [glu-agent]
May 13 14:27:37 localhost 2013/05/13 14:27:37.492 INFO [/ed/persist] waitForState([state:NONE, timeout:10s, mountPoint:/ed/persist]): true#012 [glu-agent]
May 13 14:27:37 localhost 2013/05/13 14:27:37.602 INFO [/ed/persist] clearError#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.222 INFO [ClientCnxn] Client session timed out, have not heard from server in 8730ms for sessionid 0x339e02884c609c2, closing socket connection and attempting reconnect#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.323 WARN [GroovyLangUtils] Detected unexpected exception [ignored]: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /org/glu/agents/fabrics/live/instances/i-70ba6442#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.349 WARN [AgentMain] Detected ZooKeeper failure.#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.436 INFO [ClientCnxn] Opening socket connection to server zookeeper002/172.30.70.195:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.437 INFO [ClientCnxn] Socket connection established to zookeeper002/172.30.70.195:2181, initiating session#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.440 WARN [ClientCnxnSocket] Connected to an old server; r-o mode will be unavailable#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.441 WARN [ZKClient] Expiration detected: trying to restart...#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.451 INFO [ClientCnxn] EventThread shut down#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.452 INFO [ZKClient] Entering recovery mode#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.452 WARN [ZKClient] Recovery mode: trying to reconnect to zookeeper [1]#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.452 INFO [ZooKeeper] Initiating client connection, connectString=172.30.67.95:2181,172.30.70.195:2181,172.30.73.231:2181 sessionTimeout=5000 watcher=org.linkedin.zookeeper.client.ZKClient$UniqueWatcher@71c9378c#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.487 INFO [ClientCnxn] Unable to reconnect to ZooKeeper service, session 0x339e02884c609c2 has expired, closing socket connection#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.521 INFO [/ed/persist] uninstalled#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.541 INFO [ZKClient] Exiting recovery mode.#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.773 WARN [ZooKeeperStorage] Call ignored org.linkedin.glu.agent.impl.zookeeper.ZooKeeperStorage$_clearState_closure1@6e2035d1: zookeeper is not connected#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.793 INFO [ClientCnxn] Opening socket connection to server zookeeper002/172.30.70.195:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.794 INFO [ClientCnxn] Socket connection established to zookeeper002/172.30.70.195:2181, initiating session#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.798 WARN [ClientCnxnSocket] Connected to an old server; r-o mode will be unavailable#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.798 INFO [ClientCnxn] Session establishment complete on server zookeeper002/172.30.70.195:2181, sessionid = 0x239e02a873b09ad, negotiated timeout = 5000#012 [glu-agent]
May 13 14:27:46 localhost 2013/05/13 14:27:46.909 INFO [AgentMain] Syncing filesystem <=> ZooKeeper#012 [glu-agent]
...
Reply | Threaded
Open this post in threaded view
|

Re: NoSuchAgentException between uninstall and install in glu 4.7.1

papercrane
I also found these exceptions in the org.linkedin.glu.agent-server.out log but it has no timestamps so I'm not sure that it's related:

Exception in thread "Thread-1070621" groovy.lang.GroovyRuntimeException: exception while reading process stream
        at org.codehaus.groovy.runtime.ProcessGroovyMethods$TextDumper.run(ProcessGroovyMethods.java:482)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Bad file descriptor
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:220)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
        at java.io.InputStreamReader.read(InputStreamReader.java:167)
        at java.io.BufferedReader.fill(BufferedReader.java:136)
        at java.io.BufferedReader.readLine(BufferedReader.java:299)
        at java.io.BufferedReader.readLine(BufferedReader.java:362)
        at org.codehaus.groovy.runtime.ProcessGroovyMethods$TextDumper.run(ProcessGroovyMethods.java:475)
        ... 1 more

(the other log goes through syslog via a config change)
Reply | Threaded
Open this post in threaded view
|

Re: NoSuchAgentException between uninstall and install in glu 4.7.1

sodul
Hi Justin,

I have not seen this issue yet on our in-house VMs. My guess is that there is a race condition bug in Glu that is mostly invisible as long as your instances are 'snappy' enough, but if you throw in very low CPU instances such as EC2 m1.small then you hit it more often. Have you tried to replace your zookeeper instances with something that is more responsive such as c1.medium? I assume that both your console and agents are upgraded to 4.7.1. (BTW: there is a bug in 4.7.1, so you should upgrade to 4.7.2 or 5.0.0)

Yan: I have yet to see that issue outside of an Amazon EC2 environment, but we are still operating at a much smaller scale than Justing (both in frequency and number of machines).
Reply | Threaded
Open this post in threaded view
|

Re: NoSuchAgentException between uninstall and install in glu 4.7.1

frenchyan
Administrator
In reply to this post by papercrane
In the past when I have seen this kind of error it was related to the process becoming unresponsive. Things that come to mind in this case could be:

1) the gc: you should check the gc log file both on the agent and on the Zookeeper instances and see if you can detect any strong gc activity
2) the memory itself. If for example, the agent which I am sure is used to install some applications (otherwise why would you use glu :), is competing with the memory available on the machine: the application may be using a lot of memory which does not leave much for the agent then if the process ends up being swapped out then it could trigger the same behavior since it does not run while being swapped out. If you use small instances like Stephane is saying, that could be an issue.

It seems pretty clear from the log that there is an 9s gap (it jumps from 37 to 46), which could be explained by either 1 or 2 and that would definitely trigger the kind of exception you see with ZooKeeper.

Hope this helps.
Yan