(with apologies to Drew McDermott)
Although the right people were at the right time in the same room, we did not have time to participate in this year's ICFP Programming Contest. The warm-up task was to implement the Universal Machine (UM), which was then used to run a program which contained the real riddles.
However, a few days later after reading that for speed reasons many participants had abandoned their first VMs (in their languages of choice) in favor of C reimplementations, I spent a train ride (and the subsequent Sunday afternoon) on implementing and benchmarking the UM specification in Common Lisp.
With only fourteen instructions, the interpreter loop is a pretty straight forward transcription from the specification:
(loop (setf insn (aref program (offset pc)) opcode (ldb (byte 4 28) insn) a (ldb (byte 3 6) insn) b (ldb (byte 3 3) insn) c (ldb (byte 3 0) insn)) (incf pc) (ecase opcode (#x0 (unless (zerop (reg c)) (setf (reg a) (reg b)))) (#x1 (setf (reg a) (aref (platter-array (reg b)) (offset (reg c))))) (#x2 (setf (aref (platter-array (reg a)) (offset (reg b))) (reg c))) (#x3 (setf (reg a) (logand #xFFFFFFFF (+ (reg b) (reg c))))) (#x4 (setf (reg a) (logand #xFFFFFFFF (* (reg b) (reg c))))) (#x5 (setf (reg a) (logand #xFFFFFFFF (truncate (reg b) (reg c))))) (#x6 (setf (reg a) (logand #xFFFFFFFF (lognand (reg b) (reg c))))) (#x7 (return)) (#x8 (setf (reg b) (platter-array-allocate (reg c)))) (#x9 (platter-array-free (reg c))) (#xA (write-char (code-char (logand #xff (reg c)))) (when (= (reg c) #.(char-code #\Newline)) (finish-output))) (#xB (setf (reg c) (handler-case (char-code (read-char)) (end-of-file () #xFFFFFFFF)))) (#xC (when (plusp (reg b)) (setf program (copy-seq (platter-array (reg b))))) (setf pc (reg c))) (#xD (setf (reg (ldb (byte 3 25) insn)) (ldb (byte 25 0) insn)))))
(logand #xFFFFFFFF ...)we make explicit that we implement 32-bit modular arithmetic as per the UM specification. Helpfully, SBCL knows about modular arithmetic, too, and generates efficient machine code for it.
(ldb (byte size position) integer)is a nice way to write what you would need bit-fiddling operations for in other languages. Good examples are the decoding of opcode and registers from instructions (see top of the loop) or how byte-order can be swapped in
- The really interesting opcodes are the ones dealing with arrays platters (#x1,#x2,#x8,#x9,#xC). When doing some trial runs it becomes obvious fast that in the given UM programs, arrays are allocated and freed extremely often, many of small size.
- The provided SANDmark benchmark and the actual application (the uncompressed codex) have different requirements to the VM. Just getting SANDmark fast is not necessarily helping for, e.g., the UM's adventure or the QvickBasic Compiler.
- We are benchmarking GC performance here, as well as code generator quality for tight loops with integer arithmetic.
Trust in Garbage Collection
Frederic Jolliton's UM implementation (also in CL) uses malloc-like memory management for arrays. While I did not go that far, I tried several different schemes (reusing freed arrays instead of allocating new ones, handling small arrays seperately, etc.). All of them had in common that they were not winning at lot over a simple strategy which offloads freeing to CL's Garbage Collector, often they performed even worse. Unsurprisingly, the implementation got a lot shorter too, that way.
I keep a map of UM array indices to CL arrays, and a free list to keep track of reusable indices. The latter is quite crucial to keep the size of the index map down, because arrays are allocated tens of millions of times, but, e.g., less than 50000 array indices are active at any time in the SANDmark benchmark.
For a short time, I toyed with a copy-on-write strategy for arrays
#xC (Load Program), but there was no real
gain so this idea was
abandoned. Optimizing I/O
performance was not done either this time, as it seemed less
I now DECLARE this bazaar open
I was not patient enough to let the UM without any added
declarations finish SANDmark. However, just telling SBCL that the
actually of type
platter (*)) improves things a lot. From
way too slow
we get down to acceptable 229.0 seconds.
We can put the compiler into a more aggressive mode with
Down to 170.9 seconds. We declare some variables of (user)
array-index which also gets rid of some compiler
notes, and we are at 165.5 seconds. By
0)), we allow the compiler to trust us and thus omitting
safety checks at run-time (also silencing most of the compiler
notes, as they were about fixnum checks), thus finally arriving at
145.2 seconds. A far cry from where we started, and all this
with four more lines of declarations.
Finishing TouchWe now get quite nice times:
SANDmark complete. Evaluation took: 145.229 seconds of real time 142.4289 seconds of user run time 0.568036 seconds of system run time 0 page faults and 2,651,641,176 bytes consed.However, we can still do better. If we examine the generated machine code (SBCL 0.9.13) with
DISASSEMBLE, we find that the
ECASEstatement in the instruction decoding loop is translated poorly:
; 249: E905040000 JMP L38 ; 24E: L9: 83F804 CMP EAX, 4 ; 251: 0F846E030000 JEQ L34 ; 257: 83F808 CMP EAX, 8 ; 25A: 0F8430030000 JEQ L32 ; 260: 83F80C CMP EAX, 12 ; 263: 0F8405030000 JEQ L31 ; 269: 83F810 CMP EAX, 16 ; 26C: 0F84D8020000 JEQ L30 ; 272: 83F814 CMP EAX, 20 ; 275: 0F84A3020000 JEQ L29 [...]GCC uses all kinds of tricks for large
switchstatements, like tree partitioning and jump tables. With little macrology, we can partition the case statement similarly:
(defmacro ecase/tree (keyform &body cases) (labels ((%case/tree (keyform cases) (if (<= (length cases) 4) `(ecase ,keyform ,@cases) (loop for rest-cases on cases repeat (truncate (length cases) 2) collect (first rest-cases) into first-half finally (return `(if (< ,keyform ,(caar rest-cases)) ,(%case/tree keyform first-half) ,(%case/tree keyform rest-cases))))))) (let (($keyform (gensym "CASE/TREE-"))) `(let ((,$keyform ,keyform)) ,(%case/tree $keyform (sort (copy-list cases) #'< :key #'first))))))
ECASE/TREE macro improves quality of the
generated code somewhat. The performance gain on SANDmark is about
1.2 relative to the original
SANDmark complete. Evaluation took: 120.372 seconds of real time 117.30733 seconds of user run time 0.680042 seconds of system run time 0 page faults and 2,651,634,776 bytes consed.
Running the same program with CMUCL, we get:
SANDmark complete. ; Evaluation took: ; 131.0 seconds of real time ; 126.16389 seconds of user run time ; 3.84424 seconds of system run time ; 208,528,252,530 CPU cycles ; [Run times include 17.8 seconds GC run time] ; 0 page faults and ; 2,654,874,728 bytes consed.
On a final note, g++ still generates tighter code for the corresponding switch statement, but we are not that far off.
Compared to the (then)
UM implementation I was aware of, this implementation has one
third less code, and that with taking
account. With my benchmark setup, Li Peng's implementation runs in
% time ./peng-um sandmark.umz [...] SANDmark complete. ./peng-um sandmark.umz 102.58s user 0.20s system 98% cpu 1:44.75 total
Execution is about factor 1.15 faster than my implementation in CL. Not bad, and it seems again to support Gabriel's hypothesis that one can get CL programs within twenty percent of the performance of their C counterpart.
"One of the wicked ideas from the mailing list did help however - using the pointer to an array's memory as its index instead of maintaining an "array of arrays", yielded a 20% boost in the performance of my UM implementation (as measured by SANDmark) but reduced its "portability" to only those machines where both integers and pointers are 32 bits."
The array index trick is not easily transferable to CL, at least not portably across different implementations. Then again, it also precludes running the UM on 32-bit architectures while the CL implementation presented here works fine on 64-bit hardware.
The next step, if UM speed is still not considered sufficient, would probably be to translate UM byte-code directly to Lisp, possibly as a JIT compiler. And indeed, people seem to be going the JIT route. I will save this for another time, though.
Unless noted otherwise, all times are wall clock times and were
measured with (time (run #p"sandmark.umz")) from
the REPL within SLIME,
with SBCL 0.9.13 on a
1.6 GHz Turion64 laptop with 450MB RAM, running
Linux 2.6.15. Additional tests were conducted with
SBCL 0.9.15, CMUCL
19a, and LispWorks 4.4
If anybody manages to get this code fast on OpenMCL, Lispworks, Allegro CL, etc., drop me a line.
#xD (Orthography) is somewhat special in that it
does not use the common decoding of registers. Initially I thought
it is not worth special-casing it but after wondering
why another UM
implementation was unexpectedly fast, I did some measurements
and was proven wrong. Apparently, we can get down times by another
15% when changing the execution loop like this:
(cond ((= opcode #xD) (setf (reg (ldb (byte 3 25) insn)) (ldb (byte 25 0) insn))) (t (setf a (ldb (byte 3 6) insn) b (ldb (byte 3 3) insn) c (ldb (byte 3 0) insn)) (ecase/tree opcode ...)))
SANDmark complete. Evaluation took: 103.894 seconds of real time 100.318275 seconds of user run time 0.712045 seconds of system run time 0 page faults and 2,651,630,240 bytes consed.