Glu can't sync with agent's state after an upgrade
I work with Eran Harel in Outbrain.
We are now in the process of upgrading our glu agents from version 4.4.0 to 4.7.2.
We are using Chef to maintain our infrastructure packages, so the upgrade is not done with Glu's built in agent upgrade mechanism, but in the following way:
- Stop the agent
- Move the old agent's (4.4.0) installation directory to a archive directory
- Get 4.7.2 tgz
- Untar it
- Configure it
- Start it
The agent connects to the console, but in the agent's page, all the applications that are installed on the node (and are still running as they used to, stopping the agent didn't stop them) are missing, not even marked as needing deployment. The main deploy button is red. The agent seems to initially have issues connecting to the Zookeeper:
2013/06/13 10:23:04.808 INFO [ZooKeeper] Initiating client connection, connectString=XXXXX sessionTimeout=5000 watcher=org.linkedin.zookeeper.client.ZKClient$UniqueWatcher@a413dbb
2013/06/13 10:23:04.854 INFO [ClientCnxn] Opening socket connection to server XXXX/XXXX:2181. Will not attempt to authenticate using SASL (unknown error)
2013/06/13 10:23:04.862 INFO [ClientCnxn] Socket connection established to XXXX/XXXX:2181, initiating session
2013/06/13 10:23:04.943 WARN [ClientCnxnSocket] Connected to an old server; r-o mode will be unavailable
2013/06/13 10:23:04.944 INFO [ClientCnxn] Session establishment complete on server XXXX/XXXX:2181, sessionid = 0x23f2e8c35c003bf, negotiated timeout = 5000
But later seems to be syncing:
2013/06/13 10:23:06.678 INFO [AgentMain] Syncing filesystem <=> ZooKeeper
In the dashboard, the agent seems as 'NOT deployed'.
The only thing that fixes this, is deploying the host (pushing the red button...).
The plan is install->configure->start.
Our Glu console version is 4.4.0 (should also be upgraded in the near future).
The file version.txt contains the actual version to be started or stopped. This is why it gets updated in step 4. The folder <root>/data contains the state of the agent and is OUTSIDE the actual agent (under 4.4.0 or 4.7.2)
If you do not follow this procedure and create the whole directory structure starting at <root>, then you MUST copy the data folder from the old agent into the new one prior to starting the new one... this is why you do not have any state anymore. The built-in procedure is more complicated because it does it in place (in the same folder), but if you create a new one (which is ok), then simply copying the data folder will solve your issue.