Michael Weber: Random Bits and Pieces

...Please don't assume Lisp is only useful for Animation and Graphics, AI, Bioinformatics, B2B and E-Commerce, Data Mining, EDA/Semiconductor applications, Expert Systems, Finance, Intelligent Agents, Knowledge Management, Mechanical CAD, Modeling and Simulation, Natural Language, Optimization, Research, Risk Analysis, Scheduling, Telecom, and Web Authoring just because these are the only things they happened to list.

Kent M. Pitman

Picture of UM Platter

(with apologies to Drew McDermott)

Although the right people were at the right time in the same room, we did not have time to participate in this year's ICFP Programming Contest. The warm-up task was to implement the Universal Machine (UM), which was then used to run a program which contained the real riddles.

However, a few days later after reading that for speed reasons many participants had abandoned their first VMs (in their languages of choice) in favor of C reimplementations, I spent a train ride (and the subsequent Sunday afternoon) on implementing and benchmarking the UM specification in Common Lisp.

With only fourteen instructions, the interpreter loop is a pretty straight forward transcription from the specification:


(loop
  (setf insn (aref program (offset pc))
        opcode (ldb (byte 4 28) insn)
        a (ldb (byte 3 6) insn)
        b (ldb (byte 3 3) insn)
        c (ldb (byte 3 0) insn))
  (incf pc)
  (ecase opcode
    (#x0 (unless (zerop (reg c)) (setf (reg a) (reg b))))
    (#x1 (setf (reg a) (aref (platter-array (reg b)) (offset (reg c)))))
    (#x2 (setf (aref (platter-array (reg a)) (offset (reg b))) (reg c)))
    (#x3 (setf (reg a) (logand #xFFFFFFFF (+ (reg b) (reg c)))))
    (#x4 (setf (reg a) (logand #xFFFFFFFF (* (reg b) (reg c)))))
    (#x5 (setf (reg a) (logand #xFFFFFFFF (truncate (reg b) (reg c)))))
    (#x6 (setf (reg a) (logand #xFFFFFFFF (lognand (reg b) (reg c)))))
    (#x7 (return))
    (#x8 (setf (reg b) (platter-array-allocate (reg c))))
    (#x9 (platter-array-free (reg c)))           
    (#xA (write-char (code-char (logand #xff (reg c))))
         (when (= (reg c) #.(char-code #\Newline))
           (finish-output)))
    (#xB (setf (reg c) (handler-case (char-code (read-char))
                         (end-of-file () #xFFFFFFFF))))
    (#xC (when (plusp (reg b))
           (setf program (copy-seq (platter-array (reg b)))))
         (setf pc (reg c)))
    (#xD (setf (reg (ldb (byte 3 25) insn))
               (ldb (byte 25 0) insn)))))

Noteworthy details:

  • With (logand #xFFFFFFFF ...) we make explicit that we implement 32-bit modular arithmetic as per the UM specification. Helpfully, SBCL knows about modular arithmetic, too, and generates efficient machine code for it.
  • (ldb (byte size position) integer) is a nice way to write what you would need bit-fiddling operations for in other languages. Good examples are the decoding of opcode and registers from instructions (see top of the loop) or how byte-order can be swapped in RUN.
  • The really interesting opcodes are the ones dealing with arrays platters (#x1,#x2,#x8,#x9,#xC). When doing some trial runs it becomes obvious fast that in the given UM programs, arrays are allocated and freed extremely often, many of small size.
  • The provided SANDmark benchmark and the actual application (the uncompressed codex) have different requirements to the VM. Just getting SANDmark fast is not necessarily helping for, e.g., the UM's adventure or the QvickBasic Compiler.
  • We are benchmarking GC performance here, as well as code generator quality for tight loops with integer arithmetic.

Trust in Garbage Collection

Frederic Jolliton's UM implementation (also in CL) uses malloc-like memory management for arrays. While I did not go that far, I tried several different schemes (reusing freed arrays instead of allocating new ones, handling small arrays seperately, etc.). All of them had in common that they were not winning at lot over a simple strategy which offloads freeing to CL's Garbage Collector, often they performed even worse. Unsurprisingly, the implementation got a lot shorter too, that way.

I keep a map of UM array indices to CL arrays, and a free list to keep track of reusable indices. The latter is quite crucial to keep the size of the index map down, because arrays are allocated tens of millions of times, but, e.g., less than 50000 array indices are active at any time in the SANDmark benchmark.

For a short time, I toyed with a copy-on-write strategy for arrays for opcode #xC (Load Program), but there was no real gain so this idea was abandoned. Optimizing I/O performance was not done either this time, as it seemed less critical.

I now DECLARE this bazaar open

I was not patient enough to let the UM without any added declarations finish SANDmark. However, just telling SBCL that the parameter to execute, the program, is actually of type (simple-array platter (*)) improves things a lot. From way too slow we get down to acceptable 229.0 seconds.

Next, we declare the array index arrays of type (simple-array t (*)), which brings us down to 174.3 seconds (losing 54.7 seconds).

We can put the compiler into a more aggressive mode with declaration (optimize (speed 3)). Down to 170.9 seconds. We declare some variables of (user) type array-index which also gets rid of some compiler notes, and we are at 165.5 seconds. By declaring (optimize (safety 0)), we allow the compiler to trust us and thus omitting safety checks at run-time (also silencing most of the compiler notes, as they were about fixnum checks), thus finally arriving at 145.2 seconds. A far cry from where we started, and all this with four more lines of declarations.

Finishing Touch

We now get quite nice times:
SANDmark complete.
Evaluation took:
  145.229 seconds of real time
  142.4289 seconds of user run time
  0.568036 seconds of system run time
  0 page faults and
  2,651,641,176 bytes consed.
However, we can still do better. If we examine the generated machine code (SBCL 0.9.13) with DISASSEMBLE, we find that the ECASE statement in the instruction decoding loop is translated poorly:
;      249:       E905040000       JMP L38
;      24E: L9:   83F804           CMP EAX, 4
;      251:       0F846E030000     JEQ L34
;      257:       83F808           CMP EAX, 8
;      25A:       0F8430030000     JEQ L32
;      260:       83F80C           CMP EAX, 12
;      263:       0F8405030000     JEQ L31
;      269:       83F810           CMP EAX, 16
;      26C:       0F84D8020000     JEQ L30
;      272:       83F814           CMP EAX, 20
;      275:       0F84A3020000     JEQ L29
[...]
GCC uses all kinds of tricks for large switch statements, like tree partitioning and jump tables. With little macrology, we can partition the case statement similarly:

(defmacro ecase/tree (keyform &body cases)
  (labels ((%case/tree (keyform cases)
             (if (<= (length cases) 4)
                 `(ecase ,keyform ,@cases)
                 (loop for rest-cases on cases
                       repeat (truncate (length cases) 2)
                       collect (first rest-cases) into first-half
                       finally (return `(if (< ,keyform ,(caar rest-cases))
                                            ,(%case/tree keyform first-half)
                                            ,(%case/tree keyform rest-cases)))))))
    (let (($keyform (gensym "CASE/TREE-")))
      `(let ((,$keyform ,keyform))
         ,(%case/tree $keyform (sort (copy-list cases) #'< :key #'first))))))

Using the ECASE/TREE macro improves quality of the generated code somewhat. The performance gain on SANDmark is about 1.2 relative to the original ECASE version:

SANDmark complete.
Evaluation took:
  120.372 seconds of real time
  117.30733 seconds of user run time
  0.680042 seconds of system run time
  0 page faults and
  2,651,634,776 bytes consed.

Running the same program with CMUCL, we get:

SANDmark complete.

; Evaluation took:
;   131.0 seconds of real time
;   126.16389 seconds of user run time
;   3.84424 seconds of system run time
;   208,528,252,530 CPU cycles
;   [Run times include 17.8 seconds GC run time]
;   0 page faults and
;   2,654,874,728 bytes consed.

On a final note, g++ still generates tighter code for the corresponding switch statement, but we are not that far off.

Comparison

Compared to the (then) fastest C(++) UM implementation I was aware of, this implementation has one third less code, and that with taking ECASE/TREE into account. With my benchmark setup, Li Peng's implementation runs in 104 seconds:

% time ./peng-um sandmark.umz
[...]
SANDmark complete.
./peng-um sandmark.umz  102.58s user 0.20s system 98% cpu 1:44.75 total

Execution is about factor 1.15 faster than my implementation in CL. Not bad, and it seems again to support Gabriel's hypothesis that one can get CL programs within twenty percent of the performance of their C counterpart.

Also, Ranjit Mathew reported on using array pointers as indices:

"One of the wicked ideas from the mailing list did help however - using the pointer to an array's memory as its index instead of maintaining an "array of arrays", yielded a 20% boost in the performance of my UM implementation (as measured by SANDmark) but reduced its "portability" to only those machines where both integers and pointers are 32 bits."

The array index trick is not easily transferable to CL, at least not portably across different implementations. Then again, it also precludes running the UM on 32-bit architectures while the CL implementation presented here works fine on 64-bit hardware.

Quo Vadis?

The next step, if UM speed is still not considered sufficient, would probably be to translate UM byte-code directly to Lisp, possibly as a JIT compiler. And indeed, people seem to be going the JIT route. I will save this for another time, though.


Unless noted otherwise, all times are wall clock times and were measured with (time (run #p"sandmark.umz")) from the REPL within SLIME, with SBCL 0.9.13 on a 1.6 GHz Turion64 laptop with 450MB RAM, running Linux 2.6.15. Additional tests were conducted with SBCL 0.9.15, CMUCL 19a, and LispWorks 4.4 Personal Edition.
If anybody manages to get this code fast on OpenMCL, Lispworks, Allegro CL, etc., drop me a line.

UPDATE 2008-10-30: Going Nowhere... Fast

Opcode #xD (Orthography) is somewhat special in that it does not use the common decoding of registers. Initially I thought it is not worth special-casing it but after wondering why another UM implementation was unexpectedly fast, I did some measurements and was proven wrong. Apparently, we can get down times by another 15% when changing the execution loop like this:


(cond ((= opcode #xD) (setf (reg (ldb (byte 3 25) insn))
                            (ldb (byte 25 0) insn)))
      (t (setf a (ldb (byte 3 6) insn)
               b (ldb (byte 3 3) insn)
               c (ldb (byte 3 0) insn))
         (ecase/tree opcode
           ...)))
SANDmark complete.
Evaluation took:
  103.894 seconds of real time
  100.318275 seconds of user run time
  0.712045 seconds of system run time
  0 page faults and
  2,651,630,240 bytes consed.