Monday, April 28, 2008

Some more remote utilities

Some might know that SMASH has included into its belly an automatic updating system, which allows to compile all modules and reload them on all connected nodes in run time, without ever shutting the system down. On occasions this fails and a certain service shuts down. To get a hang on this, we have created some utility routines, one permits to execute commands remotely (on the other nodes) and the other shows if any module is executing old code, so very soon we will be able to first check what module on which node is executing old code before a new compile is allowed. So far the problem seems only to reside on longer lasting timers, which are not included in the update signal, all others update correctly (it seems). Also with these utility routines we could now easily start all preconfigured services (in the cluster file) remotely, without needing to touch then anymore. Some other fixes are that the CC proxy can now send tells or whispers without subscribing to any channel; KVS can now cache and forget, when you ask kvs:cache(M,F,A,Timer) it now will update every "Timer" ms and forget about the value after a minute, so the dictionary does not pile up with hundreds of unused values, this works like a garbage collection to keep KVS trim and slim. And more tests on Mnesia have been executed and curious enough on remote nodes the dirty function is slower than the normal database access with transactions and as to be expected KVS suffers from the initial Mnesia access, but afterwards blazes thru repeated accesses, so no matter how much we try to optimize the database access, the cached behind KVS approach is still the best when repeated access to data is required. Dedicated MMO literature also comes to the conclusion that processes on a multi server structure should update their data every 5 sec., less seems to be too much. And the cloud infrastucture has given some new ideas, Erlang already has the ability to spawn remote processes, so I guess, we just might implement our own flavour of it, by speed measuring each connected node and on the other hand check how many processes are already running there, this could be something like a routine task that executes per "X" interval and then when asked to spawn a process it checks for the best node to do so, a kind of "cloud_spawn(M,F,A)". This would make the SMASH ready to process any kind of task actually, not just a specialized Simulation Framework. It just might happen =) .

And Apocalyx can indeed use tables to store toon data, I played around with the Urban Tactics demo and it can easily handle some 200 toons running around there, very nice, now we need this networked. ;)

Laters,

Sunweaver

Friday, April 25, 2008

Does SMASH do Cloud Computing ??

I was asked the other day if the SMASH does Cloud Computing or Grid Computing [cf. Cloud Computing from Wikipedia] and the answer is that while there is still a lot of ground to cover, in essence that is what the basic framework will do. The Load Balancer (LB) will check for the best resources to allocate (this is still future music) and distribute the simulation accordingly. The basic idea is the same, if you look at the graphic on Wikipedia, the cluster table (that we will have at some point and which will define how the cluster should behave) corresponds loosely to service catalog, the provisioning tool basically does what the LB will do and the Monitor is in essence what the Observer does. Once the pieces fall in place we will obviously also need a Systems Management component, which might be part of the cluster behavior.

So, given that this project was started like 4 years ago, the word "Cloud Computing" did not exist at that time, it seems to have formed during late 2007, but the idea of covering massive computations by networking them on many servers, instead of a large scale server is the same, so in essence I guess, SMASH is a Cloud Computing Framework and what's more, thanks to the unbelievable Virtual machine that Erlang runs on, we can already run cross-platform, the test cluster SMASH is being programmed on, consists of Windows, Linux and MacOS X systems.

Cheers,

Sunweaver

Monday, April 21, 2008

48 hours of endurance

Alright, not much to report for now, except for some minor changes on the Observer side, which did not catch all required objects. And the channel management requires a major overhaul, whisper commands currently trigger being subscribed to that channel, which isn't exactly desirable, meaning that we will require authorization management, ouch. That's a major CG (chat group controller) rewrite, but anyway.

And now to the 48 hours test: yes the server holds up. The caching Key Value Server (KVS) makes the whole thing fly and no messages pile up, things have never been so fast, it does not matter where a client connects to, the SMASH framework relays the messages blazingly fast to other nodes, so we know by now. This concludes stability and speed tests for the time being.

Leo, if you are reading this, I sure hope that Apocalyx permits me to create each 3D object as a table entry and not just as a variable, like your examples showing variables like avatar0 - avatar6 and so on. I will be needing to create a LUA table and put each avatar as one entry and then update that table. I guess once your new Gun Tactyx is out I might try to "socket" it and permit people to connect to a fighting event, we'll see, so now back to learning Apocalyx. And yes you should be able to see some kind of graph here soon, to reflect what we got so far.

Stay tuned,

Sunweaver
EDIT: I shortened the post.

Thursday, April 17, 2008

And there shall be speed

It seems that the new caching KVS server does miracles, the message throughput of the whole framework went through the roof, up from a maximum of 7000 messages per second (averaging 2500 at heavy load) to a minimum of 150'000 messages, that's a 2100% increase, not bad. So, now we are indeed ready for the next step, Apocalyx here we come, the next thing will be to visualize our first GUI.

Sunweaver

Tuesday, April 15, 2008

Need for speed

When doing the first message bombardment on the server and the news are mixed ones. On the one hand the server does not crash, which is nice, but after a while it starts to behave irrationally, like when logging out and logging back in, it does not seem to delete users from the DB anymore, so a kind of measurement is needed. Writing some quick benchmarks it turns out that the Chat Server (CS) can do 600'000 messages per second (mps) and the DB lookup in the chat group can only do about 6'800 mps, with 16 clients connected we get up to 2'600 mps, but I suppose that the constant bombardment becomes a problem at some point, because Erlang starts piling up messages and the missing client deletion from the DB is most likely the same cause, so a first conclusion is that Mnesia can't keep up with all those lookups on the DB and this causes messages to pile up like crazy. Fortunately this was likely to happen and now we have a legal excuse to use the KVS (Key Value Server), which has been completely dormant until now. The KVS will need to grab this kind of data from Mnesia and put it in a local table. The big difference now will be this, the KVS will only load the data from Mnesia every "x" time intervals, like maybe every second or every 5, meaning that we will lose precision in order to gain speed. We will now have to segregate our operations, those that require ultimate precision like close by moveable objects, banks, auction houses, trades and loot operations and others like chat, or even general environment operations that do not require the most precision. And we need to start with the Chat Group structure. This should be really easy to do, we will see if things speed up afterwards. A general throughput of 500'000 mps looks like our boy, but 7'000 mps does not.

And surprise, surprise, the Playstation 3 is not nearly as powerful as hoped for as a server, it's depending on the test 4-10 slower than a Dual core Windows Laptop, averaging so far like 6x slower, how very disappointing. In the CS test it scores at 95'000 mps, the DB lookup only gets 4'400 mps and it takes 24 sec. to create 10'000 CC clients, while the Laptop does that job in 7 sec., OUCH !!!!!

Long live the benchmarks.

Sunweaver

Wednesday, April 9, 2008

Notes on the work in progress

Today we will get into some more detail as to what the WIP list means and check what comes next:

Work in progress (WIP) of the To-Do List:

* Create an action plug in template [easy / pending]
Currently a user logs on and a socket + chat client process are created, what is needed now is another process that will sort out what to do with the messages, what this means is that the socket is the non-transient part of the user, if he disconnects this process gets killed immediately, but the CC client stays there for "x" time, so a user cannot just logout when in danger and no harm will happen to him, this CC socket guarantees, that he cannot cheat by quickly logging out. The plugin is required to process his requests and whatever comes to him, this includes for example:

>> Action dictionary: what actions the user may execute, like chatting, moving, joining other channels, changing clothes and other stuff, so with that you could simply delete chat from his action list and mute the user, remove his movement ability and have him stand still or disallow a certain spell for a while.

>> Swap the action plugin: under certain circumstances you may want to put the user under script control, like for cut scenes or "mind control" a user by another user, so being able to switch the action plugin is helpful.

>> Put in the ability to have several plugins, this way we could f.e. log all chat activity, maybe this is not required.

* Internal user registration (to avoid same user on several nodes) [easy / partially done]
This is done for the CC processes, but needs to be done as well for socket processes, this shouldn't be too hard.

* Cluster behavior [medium
/ pending]
Here we might predefine what server does what task and what to do when that server is unavailable. This way we don't have to configure our cluster each time we start it up.

* Zone Simulation Manager [medium / WIP]
This part will include a chat manager for a certain zone and to extract from the database all the objects that are contained within it, like fixed objects and moving objects.

* TPC/IP server with authentication [medium / WIP] + protocol converter [medium / WIP]
This is basically done, what's missing is the above mentioned database registration for logged on users. The server has already functionality in place to block certain IPs and to avoid flooding by disallowing the port for 100ms, hence only 10 actions per second are possible.

* TCP/IP test client [medium / WIP]
First client is done, which is a simple text client that displays all messages and crashes *sigh* after minutes or hours because it can't handle all the display stuff, next a graphical client is needed, this will be done in Apocalyx.

* Apocalyx integration [medium
/ pending]
The first client in Apocalyx is right ahead, this is the next step that we need to create, hopefully not too much of a mess to do.

* Master node [hard / WIP]
The ultimate goal of this is to run a process on the master node that "thinks" for the rest of the infrastructure, like how many processes are running on each node, how many SOX users, CC users, Sims and so on and rebalance the equation, this will require a lot of work and will be most likely the last thing to do.

* NPC scripts [hard
/ pending]
This will represent a mayor effort to finish, given that this will have to handle scripted quest givers, movement from one place to the other and simulate AI to give the NPCs a certain life and methods of learning, exciting but difficult, given the overall broad approach.

* Subscription Server [hard / partially done]
The first part is already implemented, each user has a validity period after which he cannot log on anymore, you can already, per shell, increase the period, but we cannot handle prolonged subscriptions as of yet nor do we have like a web-based secure server in place for this.

* Interrealm connector [hard
/ pending]
In a word: Think BIG !! Each realm might be a separate instance of the same Simulation/Game or it might be a completely different game, so user data needs to be copied over, and obviously a domain/realm browser needs to be created. Sounds like fun.

* Physique server implementation [medium / pending]
This can be done the easy or the very hard way by duplicating all the Apocalyx logic into Erlang (sounds like disgusting) or to use Apocalyx itself. If we use Apocalyx, then we have to have a client (probably for each zone) that runs it and every user will send their requests like this:

game client --> SOX --> CC --> action plugin (sorting out movements) --> physique server (sort out the ACTUAL possible moves) --> action plugin --> CC --> SOX --> game client.

In order to make this work, the Physique server needs to do a simple rendering, that does not use any textures or coplex figures, so we might represent of all items only the bounding boxes for collisions and represent users as boxes, this should make this process painless and fast and if a collision is detected then only the real movement gets reported back.


So the next steps are the SOX database registration to avoid double logins and to create a simple Apocalyx client. After that the zoning and char room part with authorizations is due.

Warm regads,

Sunweaver

Friday, April 4, 2008

SOX now upgrades correctly

One issue in the past had been, that the socket server could not be upgraded in runtime, but now it works, so we can as of now upgrade the code in runtime the whole system and not lose a heartbeat, no disconnects, nothing, the system keeps running. What this means is that we can have our game running, while you kepp customizing your code and then just compile+load the code on all connected servers while players are online and they won't even notice you changed something. This should greatly facilitate maintenance tasks and avoid any kind of downtimes. And now back to the other pending items.

Sunweaver

Tuesday, April 1, 2008

Tuning to perform

The initial performance is being tweaked, while not bad we feel, that more can be done here regardng message passing and some first strange errors have appeared, no showstoppers, but something to deal with.

Also under current analysis is how to proceed with zones and where players show up, should one player appear in the next zone when he is close to the next one, that would mean like overlapping zones and if a player is in the center he only reports his position into one if he runs towards a corner he would potentially run into up to 4 overlapping zones ... or is it better to make smaller zones and always send the coordinates into all surrounding zones, which would mean always to transmit into 9 zones. Zones could also be done through range lists and handle everything like one large zone. In order to determine this, Apocalyx will be required, so next steps must include a graphical frontend already. More coming soon.

Work in progress (WIP) of the To-Do List:
* Create an action plug in template [easy / pending]
* Internal user registration (to avoid same user on several nodes) [easy / partially done]
* Cluster behavior [medium
/ pending]
* Zone Simulation Manager [medium / WIP]
* TPC/IP server with authentication [medium / WIP] + protocol converter [medium / WIP]
* TCP/IP test client [medium / WIP]
* Apocalyx integration [medium
/ pending]
* Master node [hard / WIP]
* NPC scripts [hard
/ pending]
* Subscription Server [hard / partially done]
* Interrealm connector [hard
/ pending]

Sunweaver