Friday, April 24, 2009

Replicate this !!!

Before we start, I installed the SMASH framework on a Linux machine and boy are those sockets fast, the message sync error commented in the post below does not occur, curious, latency speed gain 3:1 over Windows sockets, I am impressed.

But that is no the main topic of today, I have gone over the edge and I am currently programming a 2nd Key Value Server (KVS2). The original KVS server is a Key Value (or Name server), that caches function call results for x amount of time recaching every n seconds, the KVS2 is different. KVS2 is designed to be a database, aka mnesia, replacement.

Mnesia is a very cool design and I like it, I especially love the fact that f.e. RAM nodes can access the DB just by declaring where the master DB is, you can fragment a DB over several nodes, it has transaction capabilities and it can reside on several nodes. If you have a lazy look up application that is awesome, but I found in real life that usually the different nodes cannot keep up with transaction volume of a MMO framework in absolutely no way, if I create one master node then I face the challenge that when a slave node does a transaction it can take up to several seconds to get a result back and if the master goes down, well, so does the whole system, if you define several masters, synchronization takes too long, way to long and if one node goes down the other starts complaining and blows up.

So here comes the salvation, I hope, the birth of KVS2. The code is distributed through the LB to the code and auto started. The first node in the system reads its data from text files and other slave nodes copy the tables over, each table runs as a named process. When you change or delete data that node will issue a broadcast to the other nodes to do the same transaction, hence they sync and each node enjoys having a local cache, speed, speed, speed. If the master goes down, big deal, the other KVS2 servers are linked to the master and detect its death, so they choose a new one, so far they pick the new master simply by name, I could also very easily ask the LB and cache with KVS the value, so that they all get the same answer, but ehm, I don't. This node is then declared the new master and end of story. The master saves the data every 5 min to disk and you can even force a node to be the master, like when the original comes back up. A word of warning here, if several nodes write to the same record, you can have confusing results, as there is no check that all nodes have the same data, if messages for the same record arrive in different order, you can have different records on each node, be careful, this general scheme assumes that only one process manages a certain record, like for example in SMASH, only supervisors write records. Yet when accuracy is critical, it is very easy to send a message to the master node, the rpc interface handles that, so it can send it to the master and the master will then send it around, an easy way to ensure transactions or uniqueness without conflicting records. Simple, short and powerful. Obviously this still needs to be tweaked and debugged, but it is already working. I will later on integrate a time stamp to enable servers to sync data if someone has more recent data, but for the time being I believe this new scheme solves my database requirements nicely, I have a local cache, self replicating data and a system that will work until the last node is shut down, removing all mnesia draw backs.

I will also integrate, somehow, preferred servers to be master nodes, which should make it easier to know where the files get saved to. The keen observer may have noticed, that we lose any kind of indexes and filters, for the time being. Filters are easy to implement actually given the [X || {Y,X} <- List] functionality, normally we know the key "Y" and get "X" as the value, but we can filter like this:

Value=[Pid || {Pid,Node} <- Answers, Node=:=Filter]

grab the PID from a tuple, when we only know the node name passed to the function as "Filter". Here we grab the value as the search value.

And there will be no indexes, it is cheaper to define a small routine that creates a separate table instead. So, a couple more days until I have the KVS2 debugged and then off to replace mnesia completely with calls to KVS2, I have to replace something like 100 calls, quite doable actually.

And then I will implement the authorization scheme, urgh !!!

Cheers,

Sunweaver

P.D. Debugging is almost done, all possible things seem to be accounted for, given the intrinsic structure of KVS and KVS2, they work like a short and long term memory, hmm I might call them both in the future Skynet Database, sounds better than Key Value Server, doesn't it ??

No comments: