Feature Request: Add validate phase to the Glu Script

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Feature Request: Add validate phase to the Glu Script

rantav
The way I see things, this is a desirable workflow (at least to me...).
When deploying a fresh version of a service to a cluster of servers:
1. Deploy to one server
2. Validate the first server
3. Deploy to 2 servers
4. Validate those two
5. Deploy to 4 servers, validate...
6 ... deploy more such that an any given time no more than Y servers are down, call it the sliding window. The sliding window grows exponentially starting at 1 ending at Y which is controlled by the user.
7. Report success if all servers were validated.
8. If any of the servers reported a failure, rollback all of them and report failure. Rolling back would again have to make sure that at least 50% of the hosts are up at any given time

Using the console's API (or maybe even the agent's API) this is probably doable but I'm suggesting this workflow as a core feature so that glu users would not have to re-implement.

I think a simple step forward, if you agree with the approach, is to add a validate() method to the GlueScript. So far we have install, configure, start, stop, unconfigure and uninstall so I suggest to add validate which, if fails, (returns false or throws) then glu would initiate a rollback protocol.

thoughts?

(originally posted here see https://github.com/linkedin/glu/issues#issue/7)
Reply | Threaded
Open this post in threaded view
|

Re: Feature Request: Add validate phase to the Glu Script

frenchyan
Administrator
This is a biggy :)

First let's start with 'validate()'. There are several things to talk about here:

1) every phase in the glu script is supposed to be blocking until the phase successfully completes. Usually the most important one is the 'start' phase. It needs to start the server and wait/block until the server is actually up and running. I do not believe that the documentation is clear about this (needs to be improved). I am also planning to add some real glu scripts (the most obvious one being a webapp within jetty) to serve as demonstration of how to write a glu script. So I am not too sure what the 'validate()' method is supposed to do, but it could be folded at the end of the start phase, in other words, start does not complete until it validates and if it does not validate, then simply fail the phase (shell.fail('validation failed'))). The only time where I think that could be a problem would be if you do not want to run validate() every time you run start (for example, if you bounce your application, should you also run validate() ?).

2) the agent already has the capability to invoke any method/closure on the script. If you check the api (https://github.com/linkedin/glu/blob/console-cli/agent/org.linkedin.glu.agent-api/src/main/groovy/org/linkedin/glu/agent/api/Agent.groovy) @ line 102 you will see that the method exists (executeCall). You can see some example in the unit test: https://github.com/linkedin/glu/blob/console-cli/agent/org.linkedin.glu.agent-impl/src/test/groovy/test/agent/impl/TestAgentImpl.groovy @ line 369 and 677. Currently neither the agent cli nor the console use this functionality (although adding it to the agent cli is rather simple).

Now let's talk about rolling back:

Rolling back is an interesting concept. Currently glu has been designed so that the system model (the one you load in the console) represents what the system should look like. When you want your system to change (for example you want to upgrade all your applications X to a new version), you change the model and you load it in the console. It generates a delta which is the base of the plan computation. I could assume that rolling back means, reload the previous system and 'fix' the delta. One very good mental model is the way git handles branching (snapshots): when you issue git checkout <sha-1> what git does is make sure that your local filesystem matches the snapshot with <sha-1>. glu acts exactly the same.

Where it starts to get tricky is if you have the following scenario:

assuming system model 1 is 'up and running'
load system model 2 (contains a change for applications X to a new version and applications Y to a new version)
'fix' the delta for applications X... everything work
'fix' the delta for applications Y... something fails... rollback...

What does it mean to issue rollback ? Shall I revert to system model 1 (in which case applications X will also be reverted to previous version) ? Or most likely what you want is a new system model (1.5 ?) which contains the changes from X but not from Y ? (which if you think about it is like rewriting the history in git). Now the issue is that the model that is in the console is neither 1 nor 2 which were models that were loaded in the console (and thus coming from some source of truth). So you are diverging.

In my mind it would make more sense to treat every change as a fully rollbackable set of changes, in other words if you are not willing to rollback the changes for applications X if applications Y fail, then split them in 2 different set of changes.

I hope all this makes sense. In the end it is not really trivial concepts: what does validate() do ? Is it different from a blocking 'start' phase ? How to implement rollback ? Then of course how does that tie back in the console and the ui and how to make it configurable enough ?

Clearly lot of work, but I think it will improve glu if made right!

Yan
Reply | Threaded
Open this post in threaded view
|

Re: Feature Request: Add validate phase to the Glu Script

rantav
I didn't know that start() can fail the installation. That'll be good enough
for my purposes, I'll use it to validate the state of my app after it'd been
started.

WRT rollbacks, if both X and Y are in the same deployment and X is
successful but then Y is failing then IMO both X and Y should be rolled
back. IMO a user updating the desired state should make the smallest diffs
possible, as long as they are atomic and take into account that if one
component fails then they are all considered failed. I think glu should
encourage this behavior of many small updates.
I agree that rollbacks can be complex, but in my view they are necessary to
be able to have a fire-and-forget type of system, e.g. if something fails
you want to trust glu that it'd sail back to safe harbor and wait for
further commands. This brings up another point, what should happen after
such failure and a rollback? I think that in this case it's plausible to
expect glu to reject all future deployments until a person goes to the
dashboard (or an API) and clicks the "I've seen what's wrong and I've fixed
it, you may continue now" button.

I've raised another point: gradual-exponential deployments (just made up the
name) which mean, deploy one, then two concurrent, then 4 up to Y concurrent
deployments (Y is a user parameter). Then if any failed then rollback (in
reverse order).
Reply | Threaded
Open this post in threaded view
|

Re: Feature Request: Add validate phase to the Glu Script

frenchyan
Administrator
According to how you describe rollback I believe we are exactly on the same page: you change the system and if anything fail then you rollback everything. Which should encourage smaller atomic set of changes (atomic meaning it works or everything is rolled back).

This way is a lot easier to implement because it is just a matter of recording which one is the current model, then loading a new one, applying changes and if anything fails then reloading/reverting to the previous one and reapplying the changes.

I noticed the "gradual-exponential deployments". I am still debating at which level it makes more sense implementing it (at least in the first round).

By the way, I will release the cli for the console sometime tomorrow (Friday 24th)

Yan
Reply | Threaded
Open this post in threaded view
|

Re: Feature Request: Add validate phase to the Glu Script

rantav
We are in the same page :)

--
/Ran
Reply | Threaded
Open this post in threaded view
|

Re: Feature Request: Add validate phase to the Glu Script

Shivram
In reply to this post by frenchyan
The idea of quick rollbacks when issues arise is great. however, if GLU is used for rolling back a part of the system that does not pass validation, it may work only until the next reload of the bom/container artifacts, since the release bom would still have the latest version of say X (that was rolled back in GLU, since it did not work) and it would show the delta again as soon as it is loaded (could be to deploy something else than X).
-->So, the rollback should be done at the release bom level as well, so the version of X is same as the one rolled back in GLU system. There was a discussion about this.
-->Also, if something in the system being executed in GLU fails, rollback the whole system to a previous state. This is possible/practical, if we have a product based system.
For example, units or multi-product, which rolls forward or rolls back as a unit (all parts in the unit are rolled back even if 1 X has an issue), however, currently this is also controlled by release bom versioning, which have to be reloaded to a previous version to instantiate a rollback. But, the great thing is, this can easily be integrated into GLU system to rollback the product (units in the system) on invalidation of a part.

Does his make sense? Let me know your thoughts, I can elaborate better, if something is not clear.
Reply | Threaded
Open this post in threaded view
|

Re: Feature Request: Add validate phase to the Glu Script

rantav
Hi Shviram, as a matter of fact, I'm not following...
What is a bom?
Are you supporting rollbacks or are you suggesting something else? Sorry, I'm not sure I understand your questions or suggestions...

I haven't deployed glu in real large scale production systems so I'm only suggesting things that I think would make sense in my future usage but that's still rather theoretical so if you have live deployment experience your milage may vary. 

Reply | Threaded
Open this post in threaded view
|

Re: Feature Request: Add validate phase to the Glu Script

frenchyan
Administrator
I think Shivram is talking about some LinkedIn specific concepts that are not part of the open source: at LinkedIn we have 3 concepts:

* a containers.xml file which describes what containers are made of
* a bom file (bom = build of material) which essentially contains versions of artifacts
* topology files which describe where containers go on which host

those 3 concepts are processed to generate the fully expanded system model.

So in the open source those concepts are not valid since we deal with the system model directly. This is why it is a little bit confusing I think :)

But in the end the concept of rollback, is always going back to a previous version of the model whether it was generated by processing 3 concepts or simply directly loaded. What is important is that the source of truth is checked in.

Yan
Reply | Threaded
Open this post in threaded view
|

Re: Feature Request: Add validate phase to the Glu Script

rantav
OK I guess...
I'm not sure I mentioned this, but in the case of rollback I think you need to try once. If the rollback itself failed (this is quite possible) then don't try anything smart, just stay where you are, don't try to rollback even more (e.g. the app server may stay down if the rollback failed). This is of course a very alarming state so it's nice to have all kinds of red UI and emails shot...
So you have two levels of errors severity, one is that the current update failed but rollback was successful (there's hope that the system is stable, although not at its latest desired version) and another is that even the rollback failed so the system is guaranteed to be in deep shit (sorry ;).

Reply | Threaded
Open this post in threaded view
|

Re: Feature Request: Add validate phase to the Glu Script

Shivram
In reply to this post by frenchyan
I have to read the notes/docs here, as right now I am thinking from LinkedIn GLU impl point of view.
Anyway, I have to read through and find out where from is the source for the system in GLU open source coming from, if not from the build artifacts?
I will reply back with any questions/suggestions