Monday, June 25, 2007

Stability & endurance tests

Alright, let's get to this week's updates. It is still to early to post any code over here, but we shall give some progress reports. In the last post we commented about next tasks, I took on these fellows:

* Internal user registration (to avoid same user on several nodes or processes) [
easy] : Every user that creates a client proxy, automatically registers on the mnesia database to avoid errors. Done for the moment, but this needs to be revised when we get to account management.

* Cluster behavior [
medium] : The cluster config will require several things, but the very first one to grab was to define a master. For this a special process called observer was created. The observer is a special registered named service "ob" that stands by and waits that every persistent or longer lasting process needs tro register with and like highlander there can be only one, you cannot run a second observer on your cluster. Once it receives a request it writes that into a public ets table and starts an asyncronous link to that task. If that task or node dies, it will be able to start a stored function, which may decide whether or not to restart or do some desaster recovery, but in the very least, it writes a special local table called monitor (the table does not get replicated for performance issues) to record all of the closed-while-monitored tasks, no matter why, this way we will ALWAYS know what happened, even if you close a whole node, actually, this is why I designed a central Observer, to catch crashed nodes, local receovery was not enough for my taste. The observer is currently a signle point of failure, because there is no backupo for it at this time, this requires improvement.

Now this topic will be ongoing, the cluster behavior obviously does not stop here, it got barely started with this, we will have to look at other topics like:
--> Server tasks (DB, Players, Simulation, NPCs, Chat, etc.)
--> Is a cluster dynamically or statically built, does it load-balance ??
--> We need to notify when a player is between two zones, in order to get smooth character display or else they would just suddenly pop into existence in the next zone,
--> Can we resize the managed zone in run-time ?? For densely populated areas f.e., this maybe rather complicated to implement as it requires continuous terrains.


TCP/IP server with authentication
I have
started to develop the TCP/IP interface which worked actually easier than I thought. So far you can connect to it, the server sends back a challenge which in the future will be a MD5(random) sort of function and the client will have to answer with an account name, MD5(challenge), which the server will validate against the database, if the password is wrong the servers disconnects, if it's correct the server will check if the user is already online and so forth and finally create the client proxy, which will be the man in the middle between the TCP/IP interface and the chat client (the low level connector). I am undecided if I do an action plugin to the CC or do a proxy that just uses the CC to connect to the low level functions, most probably I will scratch the action plugin and replace that functionality with this client proxy, it sounds like a better idea, but it also means that per client we would have 3 processes running at the bare minimum. Advantage is that you can just write whatever small program and have it call the CC or call the CC directly from the console and monitor a certain chat channel. So many decisions, so little time :) .

Endurance test:
It seems somewhat early to do stress testing, but I find that most applications don't get off the ground because the backbone has been poorly designed, so thoroughful testing is of the essence. I have tried to open massive clients and I designed a special routine for this, so far I can very easily create 10'000 or 100'000 clients, which open without any problem and send messages between them, we are on the right track and if you close one of those large nodes you get a whole lot of notifications in the database, but no errors, everything is stable, even the code updates keep working, right on !!

Next steps:
Next we will go for the action plugin or client proxy and start designing the simulation behavior, this is a major decision, so a clean and logical design will mean the world (quite literally). Also, a backup behavior for the Observer is needed, given that all his information is stored in an ets table a restart should be quite easy to implement and a second master node needs to be coded. Also the TCP/IP server needs still a lot of work.

Next tasks on the To-Do List:
* Create an action
plug in template [easy]
* Cluster behavior [medium]
* Zone Simulation Manager [
medium]
* New: backup observer and restart design
* New: mnesia fallback behavior ( to have 2 master nodes, one in stand-by mode )
*
TPC/IP server with authentication [medium] + protocol converter [medium]
*
TCP/IP test client [medium]
*
Apocalyx integration [medium]
* Master node [
hard]
* NPC scripts [hard]
* Subscription Server [hard]
* Interrealm connector [hard]

Tuesday, June 12, 2007

Status: June, 13th 2007

Project Scope
The complete implementation foresees multi-domains, account administration, zone-groups, NPCs and other simulation handling, but you have to start somewhere and that is from the ground up. In this case that means that we have to build the most basic components first to make the whole work.

Making it happen
Our first task here is to build a multi-server structure. There are two aspects to it, start using mnesia for self-replicating persistent data across the nodes and the next part is a sort of chat room server across the nodes the facilitate message passing among processes that have common interests, like for example guild chat, which obviously needs to function for members no matter on which zone and server they play or you could have your users connect to server X, zone Mainland, but start the NPCs for that zone on server Y, given that you can connect your servers at 1Gb speed, but might only have a 20Mb pipeline, speed is not the issue. This way you could also connect on server Y a service that records all the chat in a certain channel to scan for example for spammers and foul language. The mnesia and multi server chat is already done, next we need to do a sort of internal logon procedure and create an action plug in for the internal player/NPC proxies. The final implementation will open a client TCP process for every client connected, but there is going to be ALWAYS an internal client proxy, which will stay alive, even when the TCP client disconnects, so the proxy is part of the simulation implementing behavior through the action plug in, while the TCP client is just the external connector. The client proxy is going to be a fixed program implementing basic behavior regarding how to plug into the chat server, but we may want to be able to put the player in control of his char or even a "mind-controlled" NPC or disable user input for a while, when a small script runs, so the actual character behavior needs to be handled by a changeable action plug in. So the action plug in will determine how to handle requests, a user might send a request to move or chat to the proxy, the proxy sends thedata to the action handler which might determine, that the user can move or maybe he wants to chat, but given abusive behavior the administrator decided to mute this user for some time and the chat request is discarded. We can see that user handling is so much easier, but how do we start a whole zone ??

Zone handling
Let's give some initial thoughts on zones:
* Zones can be shared across servers with our multi chat room servers
* Some servers should be dedicated to user connections and not run NPCs or Sim events, they will be busy with filtering/handling requests and optimizing bandwidth usage
* If we assign zones to servers, how to handle load balancing ? Maybe assign fallback servers ?
* How to handle instances ? This needs to be coded right into the core structure
Thought: any zone is an instance, zones are linked instances, instances are stand-alone
* How to handle weather and global economic events ?
* How to handle NPCs that walk across regions ?
* How to handle flocks and armies ?
* When a node's down, where do we put his processes without interrupting?
The chat rooms stay the same, the sim needs to be restarted on a different node

It looks like we need a global resource manager and do some cluster configuration. So the cluster master needs to do some load balancing, know the optimal configs, backup configs and if else fails improvise. He will require to know number of running processes on each node, which ones are connection nodes (for users), which ones are sim nodes (for sim and/or NPCs), the main database node that writes to disk should not run anything and probably the master node, it is thinkable that the subscription master node runs password authentication on local mnesia tables which do not replicate on a separate and isolated node.


Next tasks on the To-Do List:
* Create an action plug in template [easy]
* Internal user registration (to avoid same user on several nodes or processes) [easy]
* Cluster behavior [medium]
* Zone Simulation Manager [medium]
* TPC/IP server with authentication [medium] + protocol converter [medium]
* TCP/IP test client [medium]
* Apocalyx integration [medium]
* Master node [hard]
* NPC scripts [hard]
* Subscription Server [hard]
* Interrealm connector [hard]

I will go through the list be degree of difficulty (which also reflects time required to do it).

Until next time,

Sunweaver

Welcome to the SMASH project

Hello reader.
This blog marks the beginning of a project that has the objective of producing a massive multiplayer online project written in Erlang. Actually the project started in 2005 with a lot of research, but not much code was produced until 2007.

Is it Vaporware ?
For all those in knowing whether this is real or not, be calm don't argue, it is vaporware ... until the day we publish code, so in the meanwhile, don't get anxious or annoyed.

When you say MMO, do you mean MO or MMO?
MO stands for multiplayer online game and the second M puts the massive into it. We classify hereby that any engine that only supports some 15-20 users as an MO and engines that can handle 1000+ users as MMO. Nowadays you see many announcements of people claiming to have built MMO engines, when truly they start to choke on 20+ connections. This is mostly due to a bad initial design because of a platform that cannot handle massive requests and also due to limits of a single server being used. The goal of the project is to produce an engine that can handle multiple servers and at least 1000+ users in real-time, being the limit basically only the throughput of a LAN and CPU power or the speed requirement of the application, so if a given application can live with less updates per second, thousands of users must be supported simultaneously. We don't want another toy, we want a serious MMO engine instead.

Why limit the project to 1000 users ?
This is not really meant as a limit, rather a minimum requirement and something achievable, but one thing for sure, if the Smash framework can handle 1000 its uses are unlimited, because in order to produce a 2000 user framework, you only would have to communicate 2 networks, so assuming that 1000 works, a million is easily within our grasp, too. The current tests have shown that the framework does not even choke on 50'000 test users, although sending out 50'000 messages takes some time, so 1000 really isn't a problem to accomplish and overshoot, but we need to set a target, so 1000 is the initial target, that's all.

Time frame
There is no specific time frame, which is why we classify ourselves as vaporware. The complexity of this endeavor is quite high, so trying to extrapolate a date is too difficult at this time.

Project components
Defining the development platform is vital, so Erlang is defined as the server platform and Apoxalyx for the GUI.

Why Erlang ?
* It runs on several OS like windows, Linux, BSD, MacOS and others.
* It is concurrent on the very heart, creating threads is easy and independent from the OS
* It's soft real-time, which is what we need
* One can share easily information across threads
* The Mnesia database replicates across the connected nodes
* You can update the application during normal operation
* It can be operate fault-tolerant
To mention a few advantages of Erlang, more on http://erlang.org.

Why Apocalyx ?
Because it is easy to use through LUA scripts, it's fast and it's pretty, more on http://apocalyx.sourceforge.net.

Why this blog ?
It sometimes helps to write stuff down as you go and often a loving comment of a reader/listener can help to improve things, so for now I will write here on progress and ideas and if there is something you want to contribute, please message me, note that flame posts or any unrelated posts will be deleted without question.

Bye bye,

Sunweaver