Re: Potential concurrent issue?

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

frenchyan
Administrator
I have never seen this exception before. I can see that a lock is being taken

        at org.linkedin.glu.console.controllers.ControllerBase.withLock(ControllerBase.groovy:119) 

to prevent 2 people changing the system at the same time. So I am not too sure what is happening. Maybe there is another place in the code which does not grab the lock. Need to investigate.

Yan
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

xyang
Thanks Yan, please let me know if you need any additional information.
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

xyang
In reply to this post by frenchyan
Hi Yan,

This happens quite often, would you be able to provide an update on the investigation? Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

frenchyan
Administrator
I have not had the chance to look at this problem and I am not sure when I will (as I am busy with other projects at the moment). Since the message is due to the DbCurrentSystem table which gets updated only when either 

a) somebody selects the "set current" option on the model list page (or equivalent rest call)
b) somebody loads a new system (via the UI or rest call)

I would strongly suggest to investigate on your side who (and/or which automated processes) is doing this. Although I do agree that glu should obviously not fail when it happens, in practice, changing the current model is quite a "critical" action and should probably be carefully orchestrated on your side: who is doing it and when, and I am not clear how having multiple people do it at the same time will not result in something you don't want in the end: best case scenario, glu will not fail anymore (once the issue is addressed) and will happily schedule one after the other... but on your end it will still not fix the issue of who did what and when and the unpredictability of the end result.

Yan


Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

xyang
Thanks for your response.

The Glu console which keeps having this exception runs in an environment where everyone is free to update the model. I believe everyone uses rest call to do so. Ideally Glu, which is a deployment and monitoring automation platform, should be able to order multiple concurrent model update actions.

I should also have mentioned in the original thread that after this exception got thrown, Glu console will stay in a freezing/locking state where no one can do anything and block deployments. This breaks the automated delivery pipeline which Glu plays a very important part in.
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

frenchyan
Administrator
Which version of glu/java?​

Is it new since a recent update (could have been introduced by the migration to latest grails)? Which is why I need to know versions.

"Glu... should be able to order multiple concurrent model update actions" => like I said in my previous email I do agree with this statement. Everyone is free to update the model => I am not entirely sure how you get predictable results in the end, which glu will not fix.

Yan

Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

xyang
We were using Glu 4.7.2, then upgraded to 5.5.2, now we are using 5.5.5. We start to see the issue more in 5.5.5, not sure if it happened in older versions though. Java is jdk1.7.0_07

We use version control to store the Glu model, so after the push to the version control, it gets immediately pushed to Glu, so in theory, the later push to the version control should be the end result of the Glu model. I understand Glu will not fix this issue, and the model is updated based on the timeline of the model submission via the rest call should be fine. Appreciate your help very much!

Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

frenchyan
Administrator
Could you please describe what are the actions you usually do with the model?

Are you only saving a new version? Via rest only? Do you know if you use setAsCurrent (changing which model is current? via rest? via the UI?)

What do you mean by "pushed"? How is "pushed" implemented?

From a quick look at the code, it seems that the only use case I can think of when it would happen is if something is saving a new model (in REST:  POST on /model/static or loading in UI or updating in UI) while somebody else is changing the current model (set as current: in REST: POST on /model/static with id or selecting a new current model in UI).

If you could provide steps to reproduce that would be helpful. Or at the very least detail how your automated processes/users are acting on the model.

Yan
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

xyang
All the rest actions are performed using the console-cli.py which comes with the Glu package. The action that gets invoked most is load, all the other actions like deploy/redeploy also get invoked, but less frequently than load.

I don't know what you mean by 'are you only saving a new version'. I am pretty confident that almost all the model change gets preformed by the rest call via console-cli.py. Set current model via UI is very rarely used.

Push is just the push using Git, push from local repository to the remote .

POST on /model/static sounds like the culprit. That matches the exception we found on the log.

Like I said in the previous post, we use version control for storing the model, there is also a wrapper tool  around console-cli.py in the version control that helps invoke load/deploy/redeploy/etc. actions. The automated process essentially would first go change the model, push the change to version control, then invoke the wrapper tool which upload the updated model to Glu.
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

frenchyan
Administrator
FYI I was able to reproduce the problem. That being said, the statement: "Glu console will stay in a freezing/locking state where no one can do anything and block deployments" does not seem to be a valid statement. 

I have had NO such problem with the console after the exception is raised (nor does it make sense after understanding what is really happening) and I can make it raise this exception fairly consistently. Can you please provide details on what you mean by this statement. Screenshot/exception would be helpful. I want to be able to reproduce this freezing/locking as well otherwise I will not know if I have fixed the issue or not.

Are you sure it is not the process that is updating the glu model automatically not handling error well by any chance?

Yan 
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

xyang
This post was updated on .
Hi Yan,

Sorry for the confusion. By locking/freezing state I meant after clicking 'Sign in' button on the login form, Glu console is unable to redirect the users to the dashboard page, it stays on the login page and waiting. I can see the DbCurrentSystem exception gets thrown in the log after having observed this behavior of the console. Also the rest call via console-cil.py to upload model stays hanging and eventually times out. Feels like it tries to grab a lock which wasn't released previously.

I did a grep on the logs for that exception, I found this exception raised way more times than Glu console experienced the locking/freezing issue. So I guess the condition which could raise DbCurrentSystem issue does not guarantee to produce the locking/freezing issue on the Glu console.

The process in the end only calls out to console-cli.py and load fabric at the moment.
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

frenchyan
Administrator
So here is what I am going to do:

1) I am going to release a new version of glu with what I believe is a fix for the exception (from my testing).
2) we will see if you still get the locking/freeze issue. But if/when it ever happens, please issue a "kill -3 <pid>" (on the console process). This will generate a full thread dump in the console log file. Otherwise it is impossible to know what is happening.

Yan
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

xyang
Thanks for your help Yan! I will collect more info when it happens next time.
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

frenchyan
Administrator
I just released 5.6.1 which should fix your issue.

Yan
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

frenchyan
Administrator
Were you able to try 5.6.1? Has it fixed your problem?

Yan
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

xyang
We have scheduled the update, will keep you posted.

What is the preferred way to scale Glu and manage fabrics? We have seen issue where Glu becomes really slow when the fabric is large. We tried with splitting up the fabrics and that helped to some extent, to improve the speed more, we would like to standup more Glu instances to serve different farbics/teams/projects or have a load balancer in front of all those Glu instances to balance the load, can multiple Glu instances operate on the same DB? What is your take on that?

Thanks,
Xiuwen
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

sodul
Can you give us some pointers on what is 'large' and what parts would be slow? AFAIK managing 2000 agents in the same fabric should work. I have personally managed fabrics with over 500 entries without the console showing any signs of slowness.

Keep in mind that zookeeper can be the bottleneck with Glu. I personally recommend that you use a setup with 3 or 5 zookeeper instances, more zk becomes slower. You could also experiment with zookeeper 3.5.0, it is still alpha but it has a lot of performance improvements. FYI I have not experimented with zookeeper 3.5.0 but we do use zk 3.4.6. The zk binaries that come with Glu are packaged for convenience but you are free to bring your own binaries, the console will work perfectly.
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

xiuwyang
The largest fabric we have at the moment has about 2100 agents, and we use 5 zookeeper instances on version 3.4.5. We will experiment with the new version to see if there is any performance upgrade.

Besides that, can two Glu consoles both communicate with the same database concurrently? Is that something you have definite answer or we have to experiment as well?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

frenchyan
Administrator
glu has not been designed to have 2 consoles talk to the same database​ in write mode. It will most likely fail if you attempt and is not supported.

I believe somebody had this scenario but was using the second console in read mode only (you create another user that has access to the database but you give it only read access rights so that you are sure that the console is not going to write to it and you then use this user for the second console). A scenario like this one will work. But a scenario where 2 consoles are concurrently writing will not

Yan
Reply | Threaded
Open this post in threaded view
|

Re: Potential concurrent issue?

sodul
I second that. You can have two glu consoles talking to the same database server as long as it is on separate schemas and I would recommend to use separate credentials as well.

For example assuming both consoles and mysql are on the same hosts if you have this setting for your first console:
dataSourceUrl = "jdbc:mysql://localhost/glu"
you would want something like this on the second console:
dataSourceUrl = "jdbc:mysql://localhost/glub"
12