Quantcast

About fabric and agent issue

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

About fabric and agent issue

Mr_xushuai
<pre>
Hi Yan,

Greeting!

Now,I have 400+ fabric, 5000+ hosts, and
the following is my step:
    1.configured glu - meta - model. Json. Groovy this file(This is very complex,How to simplify the step?)
    2.execute Generating the distributions [-D]
    3.Configure ZooKeeper clusters [-Z]
    4.start the agent by agentctl.sh -f XXX

Of course,agent was started and assign to fabric, This is no problem.

But now I need to do some change
such as adding a fabric,Now my step is:
1.change the glu - meta - model. Json. Groovy configuration file
2.Generating the distributions [-d]
3.Configure ZooKeeper clusters [Z]
4.In the end, I still need to create a fabric in the console interface, then the fabric is really available.
5.start the agent by agentctl.sh -f XXX(I also added an Agent to the fabric is also very trouble.)


I already encapsulate your Rest API , although it provides to create the fabric interface,only equal of the above step 4. Delete a Agent by Rest also just the clean function on console , not really unassign from the fabric and assign to other fabric.

Now I need often add or remove the fabric, and the Agent is need to frequent increase, deletion, how to synchronize the data?Do you have any good idea ?
</pre>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

frenchyan
Administrator
Wow this is certainly quite a big setup.

I am not entirely sure what you are doing, but although glu can handle it, it seems to be a rather large number of fabrics. In general fabrics are used more to represent a datacenter and are fairly "static". Although it is possible to add and/or remove fabrics, the intent was mostly to have just a few of them that don't change very often.

The main issue is that the fabric is the top level grouping in glu at which the orchestration happens. glu does not cross fabric boundaries when you generate deployment plans or things like this. Also, I am sure you have noticed, when you look at the dashboard, you always need to look at a given fabric... and the dashboard is supposed to be your quick overview of your entire system. Or when you issue a REST call, you need to provide the fabric as a key. With the numbers you give, in average you have 12 agents per fabric so with this setup it is very hard for you to get a full picture of the state of your system rapidly... in the UI you would have to look at 400 different fabrics, which is impractical. With the REST api you will have to issue 400 different calls to get the state of each fabric to know the state of your system.

When I was at LinkedIn, we essentially had fabrics like:

* prod_chicago
* prod_los_angeles
* staging_san_francisco

each fabric would have hundreds (for staging) or thousands (for prod) of agents.

You also have to remember that you need 1 model per fabric. And maybe this why you are struggling. The way you setup your environment makes it very hard to manage because you need to deal at the fabric level and you have way too many fabrics which leads to way too many models, which leads to unmanageable ui or rest calls.

If you would tell me what you have been using fabrics for, maybe I can tell you how to re-think your model to not do it this way.

Yan

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

Mr_xushuai
Thank your quickly reply!
we have 20+ main business,20*5 sub business,400+ business pool,So we make every business pool is a Fabric(business pool name equals of fabric name).5000 + machine was assigned to the business under the pool,But the 500 + machine belong to 10 different machine rooms.

such as a business pool have 20 hosts from 5 different machine rooms.In the image below you can see a basic hierarchy,We adopt the release mode is:
1、select A business pool
   Processed in our system
2、Select one or more IDC(multithreading)
   multithreading: Processed in our system
   filter: Processed in glu
3、Choose the release group (A or B, the default is first deploy group A, then group B)
   Processed in your plan(filter)
4、select the number of concurrent machine inside IDC,
   Processed in your plan
5、input a version number
6、Generate then static model



Sometime we deploy just update svn by version, So our static model need version number.But static model was automatically created(include initParamers).and We need to collect glu script logs,currently,we use command(shell.exec("curl -v -Data logs=debug_info http://...")) send rest request to our system.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

frenchyan
Administrator
I am not sure I am grasping all the concepts you have, but let me give you some pointers/ideas.

* glu is a deployment automation platform. The word "platform" is very important because it is expected that you will build your own solution on top of glu. What this means is that you will need to write some code that will interact with glu (usually the REST api).
* in general, you need to model your system (and it seems from the diagram you posted and the concepts you have) that you have already done so. You should model your system the way it makes sense to you/your company, not the way it makes sense to glu. What is very important is to have a model, usually defined as a set of files/directory structure in whatever format you want (xml, json, yaml, etc...) and is highly recommended to NOT be the glu json format. It is also highly recommended that this set of files is under version control.
* you then have a tool/build process that based on your model, will generate one (or more as in one per fabric) glu json model which will then be uploaded in glu
* glu accepts .json as well .json.groovy format. The purpose of .json.groovy is for simple cases to have the build tool embedded in the model so that you don't have to maintain both a model and a build tool. It seems from your setup that you probably need a separate build tool.
* the glu static model (not the meta model) allows you to use tags and/or metadata per entry (think gmail labels). This is what you can use to filter your view/operation to only act on portions of the model if you do not want to act on the full model. For example, you could have metadata.business, metadata.sub-business, metadata.group (I see you have groupA and groupB), then you can do any action with filters like metadata.business='business1' and metadata.group="groupA", etc...
* like I said fabrics is the top level and there is no way to go across fabrics. If you use a few static fabrics with tags and/or metadata, then it is much easier to generate fewer models and manage them. Even if you have a big model with hundreds of entries, if only one changes, glu will not touch what has not changed (the delta computation takes of this).
* again in general the glu meta model / distribution flow is really meant to generate your distribution packages the first time around and should not be necessary afterwards: the agent package should always be the same no matter how many hosts you have. 

So my recommendation based on the previous section is to try to rethink your setup where you have a few static fabrics and use the filtering capabilities of glu instead (tags and/or metadata). Then model your system the way you want and have a little tool that generate the input for glu. 

I guess another question is whether you need to add/remove hosts dynamically of very often. Or do you have a pool of hosts with glu installed on it that you can simply reuse?

Yan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

Mr_xushuai
This post was updated on .
We really appreciate your advice!

*we are generate the static model json file when read pool data, then loading by your rest API.


*Now we are create a fabric, regenerate the glu-meta-model.json.groovy file, upload the file to $GLU_HOME\models, then execute the following command
$GLU_HOME/bin/setup.sh -D -o dists models/glu-meta-model.json.groovy
$GLU_HOME/bin/setup.sh -Z -o dists models/glu-meta-model.json.groovy
This step is achieved by glu agent,We wrote a glu script.

I think your advice is perfect,But our project is already doing it according above now. if I will do as you say,I necessarily have to get a dramatic change.  In the future, the next version will do as you say.
And I want to ask is that we will create nearly 500 fabric, Do you think have any impact on the system?Besides  it will be hard to maintain you have mentioned?Of course, We load the static model json files will be frequent, Do you have any suggestion?


There is a urgent problem, I invoke rootShell.exec method in glu script, need to excecute the "wget -c --ftp-user=pplive --ftp-password='XXXXX' ftp://yum.XXX.com....." to download some file, the operation is asynchronous, However I don't want to do so, Because the script need to immediately use the downloaded file,such as rootShell.ls...? How can I solve this problem? maybe use shell.waitFor method ?

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

frenchyan
Administrator
I am not really expecting anything to break or slow down due to the high number of fabrics: each model will be a lot smaller, so it should be pretty response. I think it is mostly the management on your side that is going to be a nightmare. But I have never really tested it. If anybody on this forum has tried it, please share your experience.

Yan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

Mr_xushuai
There is a urgent problem, I invoke rootShell.exec method in glu script, need to excecute the "wget -c --ftp-user=pplive --ftp-password='XXXXX' ftp://yum.XXX.com....." to download some file, the operation is asynchronous, However I don't want to do so, Because the script need to immediately use the downloaded file,such as rootShell.ls...? How can I solve this problem? maybe use shell.waitFor method ?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

frenchyan
Administrator
First of all I am not sure why your command is asynchronous. Is it asynchronous if you run it directly in the cli? If it is not asynchronous when you run it in a terminal, it should not be with rootShell.exec. How are you invoking it?

Have you tried shell.fetch instead since it should let you download without having to worry about how to do it? (it expects a uri, which can be ftp, as well as userinfo for username and password)

shell.fetch("ftp://pplive:XXX@.../...", destination)

Yan



On Mon, Sep 22, 2014 at 3:30 PM, Mr_xushuai [via glu] <[hidden email]> wrote:
There is a urgent problem, I invoke rootShell.exec method in glu script, need to excecute the "wget -c --ftp-user=pplive --ftp-password='XXXXX' ftp://yum.XXX.com....." to download some file, the operation is asynchronous, However I don't want to do so, Because the script need to immediately use the downloaded file,such as rootShell.ls...? How can I solve this problem? maybe use shell.waitFor method ?


If you reply to this email, your message will be added to the discussion below:
http://glu.977617.n3.nabble.com/About-fabric-and-agent-issue-tp4026702p4026729.html
To start a new topic under glu, email [hidden email]
To unsubscribe from glu, click here.
NAML

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

Mr_xushuai
Sincere thanks to you for your reply!

I want to use the asterisk(*) in shell API for fuzzy matching,But can't do it.

Example1:
I want to download more war file:
shell.fetch("ftp://pplive:XXX@yum.console.com/testDir/*.war.*", destination)

Example2:
I want to remove more file:
shell.rm("*.war");
rootShell.rmdirs("*.version*");

Example3:
I want to copy more file(Not diretory):
rootShell.cp("/a/b/*","c");

Example4:Ditto
rootShell.exec("cd c;cp /a/b/* ./");//Execption:Not such diretory

But seem can't allow?Do you have any suggestion?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

frenchyan
Administrator
I am not entirely sure how shell.fetch could support "*". There is no such support with ftp/http etc...


something like:

  include(name: "*.version*")
}.each { dir -> shell.rmdirs(dir) }

shell.ls("/a/b").each { f -> shell.cp(f, "c") }

shell.exec("xxx") simply forks to execute the process so I would expect that "*" is supported.

have you tried:

rootShell.exec(pwd: "c", command: "cp /a/b/* .")

although in this case you can probably use the looping call I mentioned above

Yan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

Mr_xushuai
This post was updated on .
I created 500+ fabric,the console is very slow

My console has been print the fllowing error:

2014/09/25 10:36:52.490 INFO [ClientCnxn] Opening socket connection to server localhost.idc.pplive.cn/10.208.10.66:2181. Will not attempt to authenticate using SASL (unknown error)
2014/09/25 10:36:52.490 INFO [ClientCnxn] Socket connection established to localhost.idc.pplive.cn/10.208.10.66:2181, initiating session
2014/09/25 10:36:52.490 WARN [ClientCnxn] Session 0x0 for server localhost.idc.pplive.cn/10.208.10.66:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)


All Agent get following info when start:
2014/09/23 00:02:42.003 INFO [ClientCnxn] Opening socket connection to server 10.208.10.66/10.208.10.66:2181. Will not attempt to authenticate using SASL (unknown error)

But everything  seems to be normal.

please check the attachment(glu-meta-model.groovy)

Another problem:
Do not use your glu zookeeper, instead,integrated into our zookeeper?
I have try it, but have some problems.if this could be, Can you tell me any helpful Suggestions?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

frenchyan
Administrator
You should try with a smaller set of fabrics and see if you still get the problem. One of the issue I can think of, is although every fabric uses the same zookeeper, the console will open a connection with ZooKeeper for each of them. Unclear if that is the issue or not. But you should to reduce the number of fabrics (you can do exponential back off: 250, 125, etc...) It will be a lot easier to troubleshoot on a smaller system

Yan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

Mr_xushuai
I have tried 9 fabirc, But still have this problem(when I login console).



Another problem:
Do not use your glu zookeeper, instead,integrated into our zookeeper?
I have try it, but have some problems.if this could be, Can you tell me any helpful Suggestions?


This is the entire log(console.log):
...
localhost.idc.pplive.cn/10.208.10.66:2181. Will not attempt to authenticate using SASL (unknown error)
2014/09/25 11:42:47.239 INFO [ClientCnxn] Socket connection established to localhost.idc.pplive.cn/10.208.10.66:2181, initiating session
2014/09/25 11:42:47.243 INFO [ClientCnxn] Session establishment complete on server localhost.idc.pplive.cn/10.208.10.66:2181, sessionid = 0x148aae67fb30008, negotiated timeout = 30000
2014/09/25 11:42:47.247 INFO [FabricServiceImpl] Loaded fabrics: [fabric-1, pseg-ctr-jk, ops_adp_pool, osp-oth-er, osp-oth-php, osp-oth-rn, naty_01, naty_02, naty_03]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

frenchyan
Administrator
I don't know if it could be an issue or not (in theory in should just work) but in order to troubleshoot your issue, you should try with ZooKeeper that comes with glu and see if it goes away.

Yan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

Mr_xushuai
This post was updated on .
hi yan,
I have try with our zookeeper that comes with glu, so far it's worked. But I have some confusions, Please check the image attachments:zookeeper.jpg

I calling console rest API to create the fabric, when creating about sixtieth fabric, Throws the exception:

2014/09/26 13:43:08.770 INFO [ClientCnxn] Opening socket connection to server localhost.idc.pplive.cn/10.208.10.53:2181. Will not attempt to authenticate using SASL (unknown error)
2014/09/26 13:43:08.770 INFO [ClientCnxn] Socket connection established to localhost.idc.pplive.cn/10.208.10.53:2181, initiating session
2014/09/26 13:43:08.771 WARN [ClientCnxn] Session 0x0 for server localhost.idc.pplive.cn/10.208.10.53:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)

when creating about sixtieth fabric, call the rest Api is extremely slow, At first it only takes a few seconds,But At the moment takes about 1 minute.
when log in the console, at this time also extremely slow, can you tell me what I need to do optimization?Perhaps the jetty configuration or Meta model argument ?This had some impact on our poject,as you know, we have 500+ fabric.

Besides the console have been continuously print the log:
2014/09/26 14:28:05.011 INFO [ClientCnxn] Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2014/09/26 14:28:05.026 INFO [ClientCnxn] Opening socket connection to server localhost.idc.pplive.cn/10.208.10.53:2181. Will not attempt to authenticate using SASL (unknown error)
2014/09/26 14:28:05.026 INFO [ClientCnxn] Socket connection established to localhost.idc.pplive.cn/10.208.10.53:2181, initiating session
2014/09/26 14:28:05.027 INFO [ClientCnxn] Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2014/09/26 14:28:05.464 INFO [ClientCnxn] Opening socket connection to server localhost.idc.pplive.cn/10.208.10.53:2181. Will not attempt to authenticate using SASL (unknown error)
2014/09/26 14:28:05.465 INFO [ClientCnxn] Socket connection established to localhost.idc.pplive.cn/10.208.10.53:2181, initiating session
2014/09/26 14:28:05.465 INFO [ClientCnxn] Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2014/09/26 14:28:05.561 INFO [ClientCnxn] Opening socket connection to server localhost.idc.pplive.cn/10.208.10.53:2181. Will not attempt to authenticate using SASL (unknown error)
2014/09/26 14:28:05.561 INFO [ClientCnxn] Socket connection established to localhost.idc.pplive.cn/10.208.10.53:2181, initiating session
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

frenchyan
Administrator
I think one of the main issue is that as I mentioned in the past fabrics have not been designed to be used the way you are. Because fabrics are tied to ZooKeeper (all hosts in a fabric share a ZooKeeper) and because ZooKeeper does not really span data centers/networks, really there should be a fabric per ZooKeeper cluster.

Because of this, for each fabric, the console connects to ZooKeeper which again has been designed with the idea that it would be a different ZooKeeper. In your case, you are using the same ZooKeeper for all your fabrics and then console is connecting to ZooKeeper for each of them even if it is the exact same ZooKeeper (because again, in the design it was never meant to be used that way).

In your case you end up opening 500 connections with the same ZooKeeper and I have no clue whether that is ok or not (as far as I can tell ZooKeeper clearly handles a huge number of clients since every agent is a client...).


You may want to tinker with it and open only one connection for all your fabrics and see if it solves your issues. I am not too sure what else to suggest at this stage.

Yan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

Mr_xushuai
Hi yan

The following is my zookeeper connection string:
zk1.zookeeper.idc.pplive.cn:2181,zk2.zookeeper.idc.pplive.cn:2181,zk3.zookeeper.idc.pplive.cn:2181,zk4.zookeeper.idc.pplive.cn:2181,zk5.zookeeper.idc.pplive.cn:2181

In fact,I have config five zookeeper(cluster) .I have a query, you say that those ZooKeepers will handles a huge number of clients since every agent is a client, your mean is that I have 5000+ agent ,it will create 5000+ agent host connection with zookeeper? Instead of 500+ fabric connect zookeeper?

In other words,Our each individual zookeeper's maximum number of connections is 500, A cluster consists of  five zookeeper, allow 2500 connection , should I config two cluster and divide 500+ fabrics into two groups ?
Other question:
Why I can't see with own UI tools read some data written to the our zookeeper(Not glu zookeeper).do I need to change your code?Maybe you can check the picture:zookeeper-data.jpg


Our project has come to the final testing phase, I'm sorry to ask you a lot of problems.I am truly and sincerely obliged to you for your help!

Best regard !
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

frenchyan
Administrator
First of all I am not an expert in ZooKeeper and whatever I say may or may not be true (or may have been true in the past). 

What I can tell you from glu's perspective:

* each agent belongs to 1 and only 1 fabric and connect to the ZooKeeper cluster (as defined by your connection string). So if you have 5000 agents, you will have 5000 connections (to the cluster). 
* each agent reads their configuration from ZooKeeper when it boots (if configured that way and can be changed), but once started, the agent never reads again and only writes in ZooKeeper and does not have any "watcher"
* the console establishes 1 connection per fabric to the ZooKeeper cluster hosting this fabric and mostly reads and set watches to be notified when agents write in ZooKeeper. So if you have 500 fabrics hosted by the same ZooKeeper cluster, then the console will establish 500 connections to the same ZooKeeper cluster.

The fact that the console connects vs the agent connects should not be any different, meaning, if you had 5500 agents and 1 fabric you would have 5501 connections to the ZooKeeper cluster as a whole which would be the same as 5000 agents and 500 fabrics (in terms of connections to the cluster).

ZooKeeper uses a cluster and it is highly recommended to use at least 3 and always an odd number (3, 5, ...). You have 5 which seems like a good setup. Every time an entity connects to the cluster (agent or console), given the same connection string, will pick at random in the pool of nodes... so as an average you should have about 1/5th of your connections per node in the cluster.

But this is where my knowledge of ZooKeeper stops. I could not tell you whether a bigger (or smaller like 3) cluster, would help or not. I also do not know whether having the console only connecting 1 time to the cluster (vs 500) would make a difference in the gran scheme of things (since it comes down to my previous example 5000 + 500 vs 5500 + 1). Although the boot time should be significantly improved as the console would not need to actually establish 500 connections and wait for them to be all alive in order to boot. But once boot is done, it should behave properly.

So in the end I am still pretty unclear what the actual problem is. You mentioned slow, but is it slow all the time? Is is slow at boot? You also mentioned some warnings in the log. Is that the problem? Does glu behave abnormally?

Short of trying to have a system as intended (which is a handful of fabrics), I would suggest 2 things:

* try to check the ZooKeeper community with your issue and see what people have to say especially the warning (also I am pretty sure you can find paid support for ZooKeeper... if you use it for other reasons in your company you probably already have support) and what they recommend in terms of size of the cluster vs the number of connections to it (also are the ZooKeeper configure properly? Maybe there are some errors in the ZooKeeper log files...)
* try to modify the code I pointed out to connect to ZooKeeper only once in the console and see if that makes a difference.

Yan


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: About fabric and agent issue

Mr_xushuai
I think I can understand what you said.I guess that the number of the agent is very huge when you worked in linkedin company, it will be clear to the Zookeeper's connection is also very huge, although with only 3 fabric, How have you handled it?

I am not sure if this needs attention to the zookeeper data for display to UI at linkedin? Whether only need the glu would be able to get data? So far no problem in my glu which use our own Zookpeer.
Loading...