We have a scripts that generate the Glu model automatically based on the list of live agents and once in a while we have agents that can crash in our larger fabrics. This cause the new generation of the model to 'drop' the agent and when the agent is restarted the model show 'Should not be deployed'.
To avoid getting into this situation where an agent is only temporarily incapacitated we would like to get that list of agents that have not recently responded. Is there a way to get the agents that are not marked as active in zookeeper? I know the data on them is still present and we could query zookeeper directly, but that would be a hack way to do so and it would be cleaner if the console could expose that through a new API. The API could have a 'timeout' option to only show hosts that have been active in the last 24 or 48 hours. This could also double for monitoring purposes so we could detect agents that are no longer responding and alert on this.
Right now when an agent is down (but present in the static model) there are 2 places where you can see in the console that it is missing (see screenshot attached). If the agent is not present in the model it won't appear.
What happens if you invoke the GET /agents rest api? It should return something similar to the /console/agents page. So you could invoke this API first before generating the model.
Yes, we call that api first but we rely on some of the agents metadata in order to build the model and that metadata is no longer available when calling /agents after an agent has timed-out on zookeeper.
We have a workaround which is to write the agents metadata in the metadata section of the model as a form of persistent storage. We considered (briefly) talking to zookeeper directly but this is not an option as it would open access to zookeeper from non-prod machines and could be a security issue for our production environments. The current solution to 'abuse' the metatdata section of the model is working at the moment, but it could become a scaling issue eventually.
So here is the issue: the agent registers an ephemeral node in ZooKeeper (the node which contains the info about the agent and that is accessible in the console). The reason why it is an ephemeral node is that it is guaranteed to disappear from ZooKeeper whenever the agent is gone, whether intentionally (shutdown) or nor (software/hardware crash).
So after the agent is gone the data is no longer in ZooKeeper so it cannot be retrieved. So I am afraid that what you are asking for is just not possible (there is no API that can be exposed with the current code that will give you what you want). Even if you interrogate ZooKeeper yourself you won't find the data.
In order to implement what you are asking for, the code would have to be changed. One thing that I could imagine would be that the agent also writes a non ephemeral node. But this would have to be thought out more thoroughly (it has an impact on size, apis, data structure, etc...)
Ok, somehow I assumed that the agent data was not ephemeral - I was wrong. We have a workaround by writing the agents data back (filtered to bare minimum details we need) back under the metadata section. Works great so far. There is a little bit of inefficiency overhead but this is very acceptable at our current scale and we can switch to an other alternative easily in the future.