Michael Weber: Random Bits and Pieces

While updating my Objective-C Resources page, I visited the GNUstep website to check what's new, and was astonished to see the following headline:

GNUstep Shutting Down

GNUstep was initially created as a cross-platform, object-oriented framework for desktop application development. Based on the OpenStep specification originally created by NeXT (now Apple), GNUstep should have enabled developers to rapidly build sophisticated software by employing a large library of reusable software components. We, the GNUstep developers, however, believe that we have failed to achieve this goal.

After inquiring, it turned out this is nonsense and the website simply had been defaced, telling everybody that the project will cease development immediately, and that developers should use MacOSX and Cocoa instead:

Screenshot of defaced GNUstep website

All is well again. The site has been restored, and the GNUstep project certainly is not dead.

Picture of UM Platter

(with apologies to Drew McDermott)

Although the right people were at the right time in the same room, we did not have time to participate in this year's ICFP Programming Contest. The warm-up task was to implement the Universal Machine (UM), which was then used to run a program which contained the real riddles.

However, a few days later after reading that for speed reasons many participants had abandoned their first VMs (in their languages of choice) in favor of C reimplementations, I spent a train ride (and the subsequent Sunday afternoon) on implementing and benchmarking the UM specification in Common Lisp.

With only fourteen instructions, the interpreter loop is a pretty straight forward transcription from the specification:


(loop
  (setf insn (aref program (offset pc))
        opcode (ldb (byte 4 28) insn)
        a (ldb (byte 3 6) insn)
        b (ldb (byte 3 3) insn)
        c (ldb (byte 3 0) insn))
  (incf pc)
  (ecase opcode
    (#x0 (unless (zerop (reg c)) (setf (reg a) (reg b))))
    (#x1 (setf (reg a) (aref (platter-array (reg b)) (offset (reg c)))))
    (#x2 (setf (aref (platter-array (reg a)) (offset (reg b))) (reg c)))
    (#x3 (setf (reg a) (logand #xFFFFFFFF (+ (reg b) (reg c)))))
    (#x4 (setf (reg a) (logand #xFFFFFFFF (* (reg b) (reg c)))))
    (#x5 (setf (reg a) (logand #xFFFFFFFF (truncate (reg b) (reg c)))))
    (#x6 (setf (reg a) (logand #xFFFFFFFF (lognand (reg b) (reg c)))))
    (#x7 (return))
    (#x8 (setf (reg b) (platter-array-allocate (reg c))))
    (#x9 (platter-array-free (reg c)))           
    (#xA (write-char (code-char (logand #xff (reg c))))
         (when (= (reg c) #.(char-code #\Newline))
           (finish-output)))
    (#xB (setf (reg c) (handler-case (char-code (read-char))
                         (end-of-file () #xFFFFFFFF))))
    (#xC (when (plusp (reg b))
           (setf program (copy-seq (platter-array (reg b)))))
         (setf pc (reg c)))
    (#xD (setf (reg (ldb (byte 3 25) insn))
               (ldb (byte 25 0) insn)))))

Noteworthy details:

  • With (logand #xFFFFFFFF ...) we make explicit that we implement 32-bit modular arithmetic as per the UM specification. Helpfully, SBCL knows about modular arithmetic, too, and generates efficient machine code for it.
  • (ldb (byte size position) integer) is a nice way to write what you would need bit-fiddling operations for in other languages. Good examples are the decoding of opcode and registers from instructions (see top of the loop) or how byte-order can be swapped in RUN.
  • The really interesting opcodes are the ones dealing with arrays platters (#x1,#x2,#x8,#x9,#xC). When doing some trial runs it becomes obvious fast that in the given UM programs, arrays are allocated and freed extremely often, many of small size.
  • The provided SANDmark benchmark and the actual application (the uncompressed codex) have different requirements to the VM. Just getting SANDmark fast is not necessarily helping for, e.g., the UM's adventure or the QvickBasic Compiler.
  • We are benchmarking GC performance here, as well as code generator quality for tight loops with integer arithmetic.

Trust in Garbage Collection

Frederic Jolliton's UM implementation (also in CL) uses malloc-like memory management for arrays. While I did not go that far, I tried several different schemes (reusing freed arrays instead of allocating new ones, handling small arrays seperately, etc.). All of them had in common that they were not winning at lot over a simple strategy which offloads freeing to CL's Garbage Collector, often they performed even worse. Unsurprisingly, the implementation got a lot shorter too, that way.

I keep a map of UM array indices to CL arrays, and a free list to keep track of reusable indices. The latter is quite crucial to keep the size of the index map down, because arrays are allocated tens of millions of times, but, e.g., less than 50000 array indices are active at any time in the SANDmark benchmark.

For a short time, I toyed with a copy-on-write strategy for arrays for opcode #xC (Load Program), but there was no real gain so this idea was abandoned. Optimizing I/O performance was not done either this time, as it seemed less critical.

I now DECLARE this bazaar open

I was not patient enough to let the UM without any added declarations finish SANDmark. However, just telling SBCL that the parameter to execute, the program, is actually of type (simple-array platter (*)) improves things a lot. From way too slow we get down to acceptable 229.0 seconds.

Next, we declare the array index arrays of type (simple-array t (*)), which brings us down to 174.3 seconds (losing 54.7 seconds).

We can put the compiler into a more aggressive mode with declaration (optimize (speed 3)). Down to 170.9 seconds. We declare some variables of (user) type array-index which also gets rid of some compiler notes, and we are at 165.5 seconds. By declaring (optimize (safety 0)), we allow the compiler to trust us and thus omitting safety checks at run-time (also silencing most of the compiler notes, as they were about fixnum checks), thus finally arriving at 145.2 seconds. A far cry from where we started, and all this with four more lines of declarations.

Finishing Touch

We now get quite nice times:
SANDmark complete.
Evaluation took:
  145.229 seconds of real time
  142.4289 seconds of user run time
  0.568036 seconds of system run time
  0 page faults and
  2,651,641,176 bytes consed.
However, we can still do better. If we examine the generated machine code (SBCL 0.9.13) with DISASSEMBLE, we find that the ECASE statement in the instruction decoding loop is translated poorly:
;      249:       E905040000       JMP L38
;      24E: L9:   83F804           CMP EAX, 4
;      251:       0F846E030000     JEQ L34
;      257:       83F808           CMP EAX, 8
;      25A:       0F8430030000     JEQ L32
;      260:       83F80C           CMP EAX, 12
;      263:       0F8405030000     JEQ L31
;      269:       83F810           CMP EAX, 16
;      26C:       0F84D8020000     JEQ L30
;      272:       83F814           CMP EAX, 20
;      275:       0F84A3020000     JEQ L29
[...]
GCC uses all kinds of tricks for large switch statements, like tree partitioning and jump tables. With little macrology, we can partition the case statement similarly:

(defmacro ecase/tree (keyform &body cases)
  (labels ((%case/tree (keyform cases)
             (if (<= (length cases) 4)
                 `(ecase ,keyform ,@cases)
                 (loop for rest-cases on cases
                       repeat (truncate (length cases) 2)
                       collect (first rest-cases) into first-half
                       finally (return `(if (< ,keyform ,(caar rest-cases))
                                            ,(%case/tree keyform first-half)
                                            ,(%case/tree keyform rest-cases)))))))
    (let (($keyform (gensym "CASE/TREE-")))
      `(let ((,$keyform ,keyform))
         ,(%case/tree $keyform (sort (copy-list cases) #'< :key #'first))))))

Using the ECASE/TREE macro improves quality of the generated code somewhat. The performance gain on SANDmark is about 1.2 relative to the original ECASE version:

SANDmark complete.
Evaluation took:
  120.372 seconds of real time
  117.30733 seconds of user run time
  0.680042 seconds of system run time
  0 page faults and
  2,651,634,776 bytes consed.

Running the same program with CMUCL, we get:

SANDmark complete.

; Evaluation took:
;   131.0 seconds of real time
;   126.16389 seconds of user run time
;   3.84424 seconds of system run time
;   208,528,252,530 CPU cycles
;   [Run times include 17.8 seconds GC run time]
;   0 page faults and
;   2,654,874,728 bytes consed.

On a final note, g++ still generates tighter code for the corresponding switch statement, but we are not that far off.

Comparison

Compared to the (then) fastest C(++) UM implementation I was aware of, this implementation has one third less code, and that with taking ECASE/TREE into account. With my benchmark setup, Li Peng's implementation runs in 104 seconds:

% time ./peng-um sandmark.umz
[...]
SANDmark complete.
./peng-um sandmark.umz  102.58s user 0.20s system 98% cpu 1:44.75 total

Execution is about factor 1.15 faster than my implementation in CL. Not bad, and it seems again to support Gabriel's hypothesis that one can get CL programs within twenty percent of the performance of their C counterpart.

Also, Ranjit Mathew reported on using array pointers as indices:

"One of the wicked ideas from the mailing list did help however - using the pointer to an array's memory as its index instead of maintaining an "array of arrays", yielded a 20% boost in the performance of my UM implementation (as measured by SANDmark) but reduced its "portability" to only those machines where both integers and pointers are 32 bits."

The array index trick is not easily transferable to CL, at least not portably across different implementations. Then again, it also precludes running the UM on 32-bit architectures while the CL implementation presented here works fine on 64-bit hardware.

Quo Vadis?

The next step, if UM speed is still not considered sufficient, would probably be to translate UM byte-code directly to Lisp, possibly as a JIT compiler. And indeed, people seem to be going the JIT route. I will save this for another time, though.


Unless noted otherwise, all times are wall clock times and were measured with (time (run #p"sandmark.umz")) from the REPL within SLIME, with SBCL 0.9.13 on a 1.6 GHz Turion64 laptop with 450MB RAM, running Linux 2.6.15. Additional tests were conducted with SBCL 0.9.15, CMUCL 19a, and LispWorks 4.4 Personal Edition.
If anybody manages to get this code fast on OpenMCL, Lispworks, Allegro CL, etc., drop me a line.

UPDATE 2006-08-16: Going Nowhere... Fast

Opcode #xD (Orthography) is somewhat special in that it does not use the common decoding of registers. Initially I thought it is not worth special-casing it but after wondering why another UM implementation was unexpectedly fast, I did some measurements and was proven wrong. Apparently, we can get down times by another 15% when changing the execution loop like this:


(cond ((= opcode #xD) (setf (reg (ldb (byte 3 25) insn))
                            (ldb (byte 25 0) insn)))
      (t (setf a (ldb (byte 3 6) insn)
               b (ldb (byte 3 3) insn)
               c (ldb (byte 3 0) insn))
         (ecase/tree opcode
           ...)))
SANDmark complete.
Evaluation took:
  103.894 seconds of real time
  100.318275 seconds of user run time
  0.712045 seconds of system run time
  0 page faults and
  2,651,630,240 bytes consed.
Adobe PDF Logo

(Cue Mission Impossible theme) Your task is to combine a number of Adobe PDF files into one and to number all pages consecutively. As additional obstacle, not all pages have the same page size and some pages already contain old page numbers, in various places.
Your time starts running... now!

I glued together multiple PDFs with PDFTK, but did not find a suitable tool to remove page numbers and draw new ones on the resulting file. Until I looked at Marc Battyani's neat cl-pdf library, which seemed to have all the ingredients needed. The initial version was developed completely in the REPL, basically in no time, which means I could report back to my boss in time for him to leave and catch the train. Instant local wizard status awarded, too.

Some small amount of glue code later (pdf-page-numbers.lisp), removing the old centered page numbers at coordinates (308,148) and adding new ones in the same place is now as easy as:


(pdf:with-existing-document (#p"old.pdf")
  (with-existing-pages ()
    (pdf:insert-original-page-content)
    (clear-page-numbers 308 148 :font "Times-Bold" :font-size 10.0)
    (page-numbers 308 148 :font "Times-Bold" :font-size 10.0))
  (pdf:write-document #p"new.pdf"))

Of course, this looks like it could be packaged into a nice function as well. However, in real life we have to deal with all kinds of nastiness (see above) which would render such a function not very useful. I used the following snippet eventually:


(pdf:with-existing-document (#p"/tmp/old.pdf")
  (with-existing-pages (:range (iota 94 :start 6)
                        :numbering (enumerate :from 1))
    (pdf:insert-original-page-content)
    (let ((x (if (evenp *page-number*) 135 480))
          (y (if (<= 50 *page-number* 64) 104 149)))
      (clear-page-numbers 308 y :font "Times-Roman" :font-size 10.0)
      (page-numbers x y :font "Times-Bold" :font-size 10.0 :justify t)))
  (pdf:write-document #p"/tmp/new.pdf"))

Tony Blair warned of the arc of extremism and called for an alliance of moderation instead of export of instability (all found in a single article, no less).

Let's see... Axis of evil... where are my Political Bullshit Bingo cards?!