A Database Approach to Distributed State Space Generation (accepted for publication to PDMC 2007).
We study distributed state space generation on a cluster of workstations. It is explained why state space partitioning by a global hash function is problematic when states contain variables from unbounded domains, such as lists or other recursive datatypes. Our solution is to introduce a database which maintains a global numbering of state values. We also describe tree-compression, a technique of recursive state folding, and show that it is superior to manipulating plain state vectors.
This solution is implemented and linked to the μCRL toolset, where state values are implemented as maximally shared terms (ATerms). However, it is applicable to other models as well, e.g., PROMELA models via the NIPS virtual machine. Our experiments show the trade-offs between keeping the database global, replicated, or local, depending on the available network bandwidth and latency.
The work described here has been in successful production use for some time. Still, during the measurements of the paper, we got a truckload of ideas for improvements. Watch this space...
Fun war story: when measuring, we found that 2% of the zillion queries between the cluster nodes and the database were unproportionally slow: They took a whopping 200 milliseconds round-trip! With some analyzer gear plugged into the switch and the help of our networking guys, we were able to predict when one of the slow messages would appear (confirmed by program instrumentation). In the end we traced it down to a driver problem. Switching to a different brand of network cards made the problem go away.