Stability & endurance tests

Alright, let's get to this week's updates. It is still to early to post any code over here, but we shall give some progress reports. In the last post we commented about next tasks, I took on these fellows:

* Internal user registration (to avoid same user on several nodes or processes) [easy] : Every user that creates a client proxy, automatically registers on the mnesia database to avoid errors. Done for the moment, but this needs to be revised when we get to account management.

* Cluster behavior [medium] : The cluster config will require several things, but the very first one to grab was to define a master. For this a special process called observer was created. The observer is a special registered named service "ob" that stands by and waits that every persistent or longer lasting process needs tro register with and like highlander there can be only one, you cannot run a second observer on your cluster. Once it receives a request it writes that into a public ets table and starts an asyncronous link to that task. If that task or node dies, it will be able to start a stored function, which may decide whether or not to restart or do some desaster recovery, but in the very least, it writes a special local table called monitor (the table does not get replicated for performance issues) to record all of the closed-while-monitored tasks, no matter why, this way we will ALWAYS know what happened, even if you close a whole node, actually, this is why I designed a central Observer, to catch crashed nodes, local receovery was not enough for my taste. The observer is currently a signle point of failure, because there is no backupo for it at this time, this requires improvement.

Now this topic will be ongoing, the cluster behavior obviously does not stop here, it got barely started with this, we will have to look at other topics like:
--> Server tasks (DB, Players, Simulation, NPCs, Chat, etc.)
--> Is a cluster dynamically or statically built, does it load-balance ??
--> We need to notify when a player is between two zones, in order to get smooth character display or else they would just suddenly pop into existence in the next zone,
--> Can we resize the managed zone in run-time ?? For densely populated areas f.e., this maybe rather complicated to implement as it requires continuous terrains.

TCP/IP server with authentication
I have started to develop the TCP/IP interface which worked actually easier than I thought. So far you can connect to it, the server sends back a challenge which in the future will be a MD5(random) sort of function and the client will have to answer with an account name, MD5(challenge), which the server will validate against the database, if the password is wrong the servers disconnects, if it's correct the server will check if the user is already online and so forth and finally create the client proxy, which will be the man in the middle between the TCP/IP interface and the chat client (the low level connector). I am undecided if I do an action plugin to the CC or do a proxy that just uses the CC to connect to the low level functions, most probably I will scratch the action plugin and replace that functionality with this client proxy, it sounds like a better idea, but it also means that per client we would have 3 processes running at the bare minimum. Advantage is that you can just write whatever small program and have it call the CC or call the CC directly from the console and monitor a certain chat channel. So many decisions, so little time :) .

Endurance test:
It seems somewhat early to do stress testing, but I find that most applications don't get off the ground because the backbone has been poorly designed, so thoroughful testing is of the essence. I have tried to open massive clients and I designed a special routine for this, so far I can very easily create 10'000 or 100'000 clients, which open without any problem and send messages between them, we are on the right track and if you close one of those large nodes you get a whole lot of notifications in the database, but no errors, everything is stable, even the code updates keep working, right on !!

Next steps:
Next we will go for the action plugin or client proxy and start designing the simulation behavior, this is a major decision, so a clean and logical design will mean the world (quite literally). Also, a backup behavior for the Observer is needed, given that all his information is stored in an ets table a restart should be quite easy to implement and a second master node needs to be coded. Also the TCP/IP server needs still a lot of work.

Next tasks on the To-Do List:
* Create an action plug in template [easy]
* Cluster behavior [medium]
* Zone Simulation Manager [medium]
* New: backup observer and restart design
* New: mnesia fallback behavior ( to have 2 master nodes, one in stand-by mode )
* TPC/IP server with authentication [medium] + protocol converter [medium]
* TCP/IP test client [medium]
* Apocalyx integration [medium]
* Master node [hard]
* NPC scripts [hard]
* Subscription Server [hard]
* Interrealm connector [hard]