Clozure CL Internals

Assembler

The built-in assembler is used both for LAP functions and for defining vinsn templates.

ARM (32-bit)

In general, an ARM memory operand is of the form “register + <offset>”, where <offset> can be an immediate constant, a register, or a shifted register.

(:$ n) -> immediate
Offset addressing:
(:@ r <offset>) -> ea = register + optional offset
(:+@ r <offset>) -> same as above
(:-@ r <offset) -> ea = register - optional offset
Pre-indexed addressing (with writeback):
(:@! r <offset>) -> ea = register + optional offset; write ea back to r
(:+@! r <offset>) -> same as above
(:-@! r <offset>) -> ea = register - optional offset; write ea back to r
Post-indexed addressing (with writeback):
(:@+ r <offset>) -> ea = r; write ea + offset back to r
(:@- r <offset>) -> ea = r; write ea - offset back to r
<offset> can be an immediate, a register, or a shifted register.
(:lsl r1 imm/r2) -> offset = r1 shifted by imm/r2
(:lsr r1 imm/r2) -> 
(:asr r1 imm/r2) -> 
(:ror r1 imm/r2) ->
# rotate right with extend; the carry flag will be rotated into bit 31
(:rrx r)) ->

x86

The LAP notatation for x86 is based on the AT&T style syntax as used in traditional Unix assemblers like GNU as. Notably, the destination operand generally comes last, instead of first, as in the Intel syntax.

(% r) -> register
($ n) -> immediate
(@ a) -> memory operand
The general memory operand syntax is
([seg] [disp] [base] [index] [scale])
but not all combinations are meaningful.  For instance, a [seg] by iteslf
is pretty useless.
Thus one might write:
(:@ x8664::misc-data-offset (:%q vec) (:%q word-index) 8)
(:rcontext a) -> memory operand, using segment register or gpr
(:self fn) -> special self-reference, used by the gc on 32-bit x86
s -> a label

Backend

Writing

(with-imm-target (other-reg) reg ...)

means that we want to assign a register to “reg”, but it can’t be “other-reg. The vinsn operator type :imm generally means “can hold a fixnum or other lisp immediate type.” In other words, such an operand can be placed in an immediate (unboxed) register.

One can also write

(with-node-target (other-reg) reg ...)

if the register has to be able to hold boxed lisp objects (nodes).

The difference between with-xxx-temps and with-xxx-target is that with-xxx-temps means to find a register and mark it as being in use, i.e, not available for allocation as a temporary.

On the other hand, with-xxx-target means to find a register that is not marked as being in use, and which does not conflict with these other specified reigsters.

As an example, you might want to say “get the vector, index, and new value into any 3 registers; it doesn't matter which 3, but in general we want them to be distinct from each other.” While getting those 3 values into those registers, we might do some pushing and popping, but we should otherwise be free to allocate temporaries that conflict with those registers as long as things wind up in the right places.

In a few other cases, it’s reasonable to say “mark this as being in use, so that it isn't allocated as a temporary inside a vinsn.” That’s useful in some cases, but a bit more dangerous (in that we can run out of registers through overuse of this fairly easily.)

Implementation Details of Clozure CL

This chapter describes many aspects of OpenMCL's implementation as of (roughly) version 1.1. Details vary a bit between the three architectures (PPC32, PPC64, and x86-64) currently supported and those details change over time, so the definitive reference is the source code (especially some files in the ccl/compiler/ directory whose names contain the string "arch" and some files in the ccl/lisp-kernel/ directory whose names contain the string "constants".) Hopefully, this chapter will make it easier for someone who's interested to read and understand the contents of those files.

Threads and exceptions

Clozure CL's threads are "native" (meaning that they're scheduled and controlled by the operating system.) Most of the implications of this are discussed elsewhere; this section tries to describe how threads look from the lisp kernel's perspective (and especially from the GC's point of view.)

Clozure CL's runtime system tries to use machine-level exception mechanisms (conditional traps when available, illegal instructions, memory access protection in some cases) to detect and handle exceptional situations. These situations include some TYPE-ERRORs and PROGRAM-ERRORS (notably wrong-number-of-args errors), and also include cases like "not being able to allocate memory without GCing or obtaining more memory from the OS." The general idea is that it's usually faster to pay (very occasional) exception-processing overhead and figure out what's going on in an exception handler than it is to maintain enough state and context to handle an exceptional case via a lighter-weight mechanism when that exceptional case (by definition) rarely occurs.

Some emulated execution environments (the Rosetta PPC emulator on x86 versions of Mac OS X) don't provide accurate exception information to exception handling functions. Clozure CL can't run in such environments.

The Thread Context Record

When a lisp thread is first created (or when a thread created by foreign code first calls back to lisp), a data structure called a Thread Context Record (or TCR) is allocated and initialized. On modern versions of Linux and FreeBSD, the allocation actually happens via a set of thread-local-storage ABI extensions, so a thread's TCR is created when the thread is created and dies when the thread dies. (The World's Most Advanced Operating System-as Apple's marketing literature refers to Darwin-is not very advanced in this regard, and I know of no reason to assume that advances will be made in this area anytime soon.)

A TCR contains a few dozen fields (and is therefore a few hundred bytes in size.) The fields are mostly thread-specific information about the thread's stacks' locations and sizes, information about the underlying (POSIX) thread, and information about the thread's dynamic binding history and pending CATCH/UNWIND-PROTECTs. Some of this information could be kept in individual machine registers while the thread is running (and the PPC - which has more registers available - keeps a few things in registers that the X86-64 has to access via the TCR), but it's important to remember that the information is thread-specific and can't (for instance) be kept in a fixed global memory location.

When lisp code is running, the current thread's TCR is kept in a register. On PPC platforms, a general purpose register is used; on x86-64, an (otherwise nearly useless) segment register works well (prevents the expenditure of a more generally useful general- purpose register for this purpose.)

The address of a TCR is aligned in memory in such a way that a FIXNUM can be used to represent it. The lisp function CCL::%CURRENT-TCR returns the calling thread's TCR as a fixnum; actual value of the TCR's address is 4 or 8 times the value of this fixnum.

When the lisp kernel initializes a new TCR, it's added to a global list maintained by the kernel; when a thread exits, its TCR is removed from this list.

When a thread calls foreign code, lisp stack pointers are saved in its TCR, lisp registers (at least those whose value should be preserved across the call) are saved on the thread's value stack, and (on x86-64) RSP is switched to the control stack. A field in the TCR (tcr.valence) is then set to indicate that the thread is running foreign code, foreign argument registers are loaded from a frame on the foreign stack, and the foreign function is called. (That's a little oversimplified and possibly inaccurate, but the important things to note are that the thread "stops following lisp stack and register usage conventions" and that it advertises the fact that it's done so. Similar transitions in a thread's state ("valence") occur when it enters or exits an exception handler (which is sort of an OS/hardware-mandated foreign function call where the OS thoughtfully saves the thread's register state for it beforehand.)

Exception contexts, and exception-handling in general

Unix-like OSes tend to refer to exceptions as "signals"; the same general mechanism ("signal handling") is used to process both asynchronous OS-level events (such as the result of the keyboard driver noticing that ^C or ^Z has been pressed) and synchronous hardware-level events (like trying to execute an illegal instruction or access protected memory.) It makes some sense to defer ("block") handling of asynchronous signals so that some critical code sequences complete without interruption; since it's generally not possible for a thread to proceed after a synchronous exception unless and until its state is modified by an exception handler, it makes no sense to talk about blocking synchronous signals (though some OSes will let you do so and doing so can have mysterious effects.)

On OSX/Darwin, the POSIX signal handling facilities coexist with lower-level Mach-based exception handling facilities. Unfortunately, the way that this is implemented interacts poorly with debugging tools: GDB will generally stop whenever the target program encounters a Mach-level exception and offers no way to proceed from that point (and let the program's POSIX signal handler try to handle the exception); Apple's CrashReporter program has had a similar issue and, depending on how it's configured, may bombard the user with alert dialogs which falsely claim that an application has crashed (when in fact the application in question has routinely handled a routine exception.) On Darwin/OSX, Clozure CL uses Mach thread-level exception handling facilities which run before GDB or CrashReporter get a chance to confuse themselves; Clozure CL's Mach exception handling tries to force the thread which received a synchronous exception to invoke a signal handling function ("as if" signal handling worked more usefully under Darwin.) Mach exception handlers run in a dedicated thread (which basically does nothing but wait for exception messages from the lisp kernel, obtain and modify information about the state of threads in which exceptions have occurred, and reply to the exception messages with an indication that the exception has been handled. The reply from a thread-level exception handler keeps the exception from being reported to GDB or CrashReporter and avoids the problems related to those programs. Since Clozure CL's Mach exception handler doesn't claim to handle debugging-related exceptions (from breakpoints or single-step operations), it's possible to use GDB to debug Clozure CL.

On platforms where signal handling and debugging don't get in each other's way, a signal handler is entered with all signals blocked. (This behavior is specified in the call to the sigaction() function which established the signal handler.) The signal handler receives three arguments from the OS kernel; the first is an integer that identifies the signal, the second is a pointer to an object of type "siginfo_t", which may or may not contain a few fields that would help to identify the cause of the exception, and the third argument is a pointer to a data structure (called a "ucontext" or something similar), which contains machine-dependent information about the state of the thread at the time that the exception/signal occurred. While asynchronous signals are blocked, the signal handler stores the pointer to its third argument (the "signal context") in a field in the current thread's TCR, sets some bits in another TCR field to indicate that the thread is now waiting to handle an exception, unblocks asynchronous signals, and waits for a global exception lock that serializes exception processing.

On Darwin, the Mach exception thread creates a signal context (and maybe a siginfo_t structure), stores the signal context in the thread's TCR, sets the TCR field which describes the thread's state, and arranges that the thread resume execution at its signal handling function (with a signal handler, possibly NULL siginfo_t, and signal context as arguments. When the thread resumes, it waits for the global exception lock.

On x86-64 platforms where signal handing can be used to handle synchronous exceptions, there's an additional complication: the OS kernel ordinarily allocates the signal context and siginfo structures on the stack of the thread that received the signal; in practice, that means "wherever RSP is pointing." Clozure CL's Register and stack usage conventions require that the thread's value stack-where RSP is usually pointing while lisp code is running-contain only "nodes" (properly tagged lisp objects), and scribbling a signal context all over the value stack would violate this requirement. To maintain consistency, the sigaltstack() mechanism is used to cause the signal to be delivered on (and the signal context and siginfo to be allocated on) a special stack area (the last few pages of the thread's control stack, in practice). When the signal handler runs, it (carefully) copies the signal context and siginfo to the thread's control stack and makes RSP point into that stack before invoking the "real" signal handler. The effect of this hack is that the "real" signal handler always runs on the thread's control stack.

Once the exception handler has obtained the global exception lock, it uses the values of the signal number, siginfo_t, and signal context arguments to determine the (logical) cause of the exception. Some exceptions may be caused by factors that should generate lisp errors or other serious conditions (stack overflow); if this is the case, the kernel code may release the global exception lock and call out to lisp code. (The lisp code in question may need to repeat some of the exception decoding process; in particular, it needs to be able to interpret register values in the signal context that it receives as an argument.)

In some cases, the lisp kernel exception handler may not be able to recover from the exception (this is currently true of some types of memory-access fault and is also true of traps or illegal instructions that occur during foreign code execution. In such cases, the kernel exception handler reports the exception as "unhandled", and the kernel debugger is invoked.

If the kernel exception handler identifies the exception's cause as being a transient out-of-memory condition (indicating that the current thread needs more memory to cons in), it tries to make that memory available. In some cases, doing so involves invoking the GC.

Threads, exceptions, and the GC

Clozure CL's GC is not concurrent: when the GC is invoked in response to an exception in a particular thread, all other lisp threads must stop until the GC's work is done. The thread that triggered the GC iterates over the global TCR list, sending each other thread a distinguished "suspend" signal, then iterates over the list again, waiting for a per-thread semaphore that indicates that the thread has received the "suspend" signal and responded appropriately. Once all other threads have acknowledged the request to suspend themselves, the GC thread can run the GC proper (after doing any necessary PC-lusering.) Once the GC's completed its work, the thread that invoked the GC iterates over the global TCR list, raising a per-thread "resume" semaphore for each other thread.

The signal handler for the asynchronous "suspend" signal is entered with all asynchronous signals blocked. It saves its signal-context argument in a TCR slot, raises the tcr's "suspend" semaphore, then waits on the TCR's "resume" semaphore.

The GC thread has access to the signal contexts of all TCRs (including its own) at the time when the thread received an exception or acknowledged a request to suspend itself. This information (and information about stack areas in the TCR itself) allows the GC to identify the "stack locations and register contents" that are elements of the GC's root set.

PC-lusering

It's not quite accurate to say that Clozure CL's compiler and runtime follow precise stack and register usage conventions at all times; there are a few exceptions:

On both PPC and x86-64 platforms, consing isn't fully atomic.It takes at least a few instructions to allocate an object in memory(and slap a header on it if necessary); if a thread is interrupted in the middle of that instruction sequence, the new object may or may not have been created or fully initialized at the point in time that the interrupt occurred. (There are actually a few different states of partial initialization)
On the PPC, the common act of building a lisp control stack frame involves allocating a four-word frame and storing three register values into that frame. (The fourth word - the back pointer to the previous frame - is automatically set when the frame is allocated.) The previous contents of those three words are unknown (there might have been a foreign stack frame at the same address a few instructions earlier),so interrupting a thread that's in the process of initializing a PPC control stack frame isn't GC-safe.
There are similar problems with the initialization of temp stackframes on the PPC. (Allocation and initialization doesn't happen atomically, and the newly allocated stack memory may have undefined contents.)
The ephemeral GC's write barrier has to be implemented atomically (i.e.,both an intergenerational store and the update of a corresponding reference bit has to happen without interruption, or neither of these events can happen.)
There are a few more similar cases.

Fortunately, the number of these non-atomic instruction sequences is small, and fortunately it's fairly easy for the interrupting thread to recognize when the interrupted thread is in the middle of such a sequence. When this is detected, the interrupting thread modifies the state of the interrupted thread (modifying its PC and other registers) so that it is no longer in the middle of such a sequence (it's either backed out of it or the remaining instructions are emulated.)

This works because (a) many of the troublesome instruction sequences are PPC-specific and it's relatively easy to partially disassemble the instructions surrounding the interrupted thread's PC on the PPC and (b) those instruction sequences are heavily stylized and intended to be easily recognized.

Register usage and tagging

Overview

Regardless of other details of its implementation, a garbage collector's job is to partition the set of all heap-allocated lisp objects (CONSes, STRINGs, INSTANCEs, etc.) into two subsets. The first subset contains all objects that are transitively referenced from a small set of "root" objects (the contents of the stacks and registers of all active threads at the time the GC occurs and the values of some global variables.) The second subset contains everything else: those lisp objects that are not transitively reachable from the roots are garbage, and the memory occupied by garbage objects can be reclaimed (since the GC has just proven that it's impossible to reference them.)

The set of live, reachable lisp objects basically form the nodes of a (usually large) graph, with edges from each node A to any other objects (nodes) that object A references.

Some nodes in this graph can never have outgoing edges: an array with a specialized numeric or character type usually represents its elements in some (possibly more compact) specialized way. Some nodes may refer to lisp objects that are never allocated in memory (FIXNUMs, CHARACTERs, SINGLE-FLOATs on 64-bit platforms ..) This latter class of objects are sometimes called "immediates", but that's a little confusing because the term "immediate" is sometimes used to refer to things that can never be part of the big connectivity graph (e.g., the "raw" bits that make up a floating-point value, foreign address, or numeric value that needs to be used - at least fleetingly - in compiled code.)

For the GC to be able to build the connectivity graph reliably, it's necessary for it to be able to reliably tell (a) whether or not a "potential root" - the contents of a machine register or stack location - is in fact a node and (b) for any node, whether it may have components that refer to other nodes.

There's no reliable way to answer the first question on stock hardware. (If everything was a node, as might be the case on specially microcoded "lisp machine" hardware, it wouldn't even need to be asked.) Since there's no way to just look at a machine word (the contents of a machine register or stack location) and tell whether or not it's a node or just some random non-node value, we have to either adopt and enforce strict conventions on register and stack usage or tolerate ambiguity.

"Tolerating ambiguity" is an approach taken by some ("conservative") GC schemes; by contrast, Clozure CL's GC is "precise", which in this case means that it believes that the contents of certain machine registers and stack locations are always nodes and that other registers and stack locations are never nodes and that these conventions are never violated by the compiler or runtime system. The fact that threads are preemptively scheduled means that a GC could occur (because of activity in some other thread) on any instruction boundary, which in turn means that the compiler and runtime system must follow precise Register and stack usage conventions at all times.

Once we've decided that a given machine word is a node, a Tagging scheme describes how the node's value and type are encoded in that machine word.

Most of this discussion-so far-has treated things from the GC's very low-level perspective. From a much higher point of view, lisp functions accept nodes as arguments, return nodes as values, and (usually) perform some operations on those arguments in order to produce those results. (In many cases, the operations in question involve raw non-node values.) Higher-level parts of the lisp type system (functions like TYPE-OF and CLASS-OF, etc.) depend on the Tagging scheme.

pc-locatives on the PPC

On the PPC, there's a third case (besides "node" and "immediate" values). As discussed below, a node that denotes a memory-allocated lisp object is a biased (tagged) pointer -to- that object; it's not generally possible to point -into- some composite (multi-element) object (such a pointer would not be a node, and the GC would have no way to update the pointer if it were to move the underlying object.)

Such a pointer ("into" the interior of a heap-allocated object) is often called a locative; the cases where locatives are allowed in Clozure CL mostly involve the behavior of function call and return instructions. (To be technically accurate, the other case also arises on x86-64, but that case isn't as user-visible.)

On the PowerPC (both PPC32 and PPC64), all machine instructions are 32 bits wide and all instruction words are allocated on 32-bit boundaries. In PPC Clozure CL, a CODE-VECTOR is a specialized type of vector-like object; its elements are 32-bit PPC machine instructions. A CODE-VECTOR is an attribute of a FUNCTION object; a function call involves accessing the function's code-vector and jumping to the address of its first instruction.

As each instruction in the code vector sequentially executes, the hardware program counter (PC) register advances to the address of the next instruction (a locative into the code vector); since PPC instructions are always 32 bits wide and aligned on 32-bit boundaries, the low two bits of the PC are always 0. If the function executes a call (simple call instructions have the mnemonic "bl" on the PPC, which stands for "branch and link"), the address of the next instruction (also a word-aligned locative into a code-vector) is copied into the special- purpose PPC "link register" (lr); a function returns to its caller via a "branch to link register" (blr) instruction. Some cases of function call and return might also use the PPC's "count register" (ctr), and if either the lr or ctr needs to be stored in memory it needs to first be copied to a general-purpose register.

Clozure CL's GC understands that certain registers contain these special "pc-locatives" (locatives that point into CODE-VECTOR objects); it contains special support for finding the containing CODE-VECTOR object and for adjusting all of these "pc-locatives" if the containing object is moved in memory. The first part of that operation-finding the containing object-is possible and practical on the PPC because of architectural artifacts (fixed-width instructions and arcana of instruction encoding.) It's not possible on x86-64, but fortunately not necessary either (though the second part - adjusting the PC/RIP when the containing object moves) is both necessary and simple.

Register and stack usage conventions

Stack conventions

On both PPC and X86 platforms, each lisp thread uses 3 stacks; the ways in which these stacks are used differs between the PPC and X86.

Each thread has:

A "control stack". On both platforms, this is "the stack" used by foreign code. On the PPC, it consists of a linked list of frames where the first word in each frame points to the first word in the previous frame (and the outermost frame points to 0.) Some frames on a PPC control stack are lisp frames; lisp frames are always 4 words in size and contain (in addition to the back pointer to the previous frame) the calling function (a node), the return address (a "locative" into the calling function's code-vector), and the value to which the value-stack pointer (see below) should be restored on function exit. On the PPC, the GC has to look at control-stack frames, identify which of those frames are lisp frames, and treat the contents of the saved function slot as a node (and handle the return address locative specially.) On x86-64, the control stack is used for dynamic-extent allocation of immediate objects. Since the control stack never contains nodes on x86-64, the GC ignores it on that platform. Alignment of the control stack follows the ABI conventions of the platform (at least at any point in time where foreign code could run.) On PPC, the r1 register always points to the top of the current thread's control stack; on x86-64, the RSP register points to the top of the current thread's control stack when the thread is running foreign code and the address of the top of the control stack is kept in the thread's TCR (see The Thread Context Record when not running foreign code. The control stack "grows down."
A "value stack". On both platforms, all values on the value stack are nodes (including "tagged return addresses" on x86-64.) The value stack is always aligned to the native word size; objects are always pushed on the value stack using atomic instructions ("stwu"/"stdu" on PPC, "push" on x86-64), so the contents of the value stack between its bottom and top are always unambiguously nodes; the compiler usually tries to pop or discard nodes from the value stack as soon as possible after their last use (as soon as they may have become garbage.) On x86-64, the RSP register addresses the top of the value stack when running lisp code; that address is saved in the TCR when running foreign code. On the PPC, a dedicated register (VSP, currently r15) is used to address the top of the value stack when running lisp code, and the VSP value is saved in the TCR when running foreign code. The value stack grows down.
A "temp stack". The temp stack consists of a linked list of frames, each of which points to the previous temp stack frame. The number of native machine words in each temp stack frame is always even, so the temp stack is aligned on a two-word (64- or 128-bit) boundary. The temp stack is used for dynamic-extent objects on both platforms; on the PPC, it's used for essentially all such objects (regardless of whether or not the objects contain nodes); on the x86-64, immediate dynamic-extent objects (strings, foreign pointers, etc.) are allocated on the control stack and only node-containing dynamic-extent objects are allocated on the temp stack. Data structures used to implement CATCH and UNWIND-PROTECT are stored on the temp stack on both ppc and x86-64. Temp stack frames are always doublenode aligned and objects within a temp stack frame are aligned on doublenode boundaries. The first word in each frame contains a back pointer to the previous frame; on the PPC, the second word is used to indicate to the GC whether the remaining objects are nodes (if the second word is 0) or immediate (otherwise.) On x86-64, where temp stack frames always contain nodes, the second word is always 0. The temp stack grows down. It usually takes several instructions to allocate and safely initialize a temp stack frame that's intended to contain nodes, and the GC has to recognize the case where a thread is in the process of allocating and initializing a temp stack frame and take care not to interpret any uninitialized words in the frame as nodes. The PPC keeps the current top of the temp stack in a dedicated register (TSP, currently r12) when running lisp code and saves this register's value in the TCR when running foreign code. The x86-64 keeps the address of the top of each thread's temp stack in the thread's TCR.

Register conventions

If there are a "reasonable" (for some value of "reasonable") number of general-purpose registers and the instruction set is "reasonably" orthogonal (most instructions that operate on GPRs can operate on any GPR), then it's possible to statically partition the GPRs into at least two sets: "immediate registers" never contain nodes, and "node registers" always contain nodes. (On the PPC, a few registers are members of a third set of "PC locatives", and on both platforms some registers may have dedicated roles as stack or heap pointers; the latter class is treated as immediates by the GC proper but may be used to help determine the bounds of stack and heap memory areas.)

The ultimate definition of register partitioning is hardwired into the GC in functions like "mark_xp()" and "forward_xp()", which process the values of some of the registers in an exception frame as nodes and may give some sort of special treatment to other register values they encounter there.)

On x86-64, the static register partitioning scheme involves:

(only) three "immediate" registers.

The RAX, RCX, and RDX registers are used as the implicit operands and results of some extended-precision multiply and divide instructions which generally involve non-node values; since their use in these instructions means that they can't be guaranteed to contain node values at all times, it's natural to put these registers in the "immediate" set. RAX is generally given the symbolic name "imm0", RDX is given the symbolic name "imm1" and RCX is given the symbolic name "imm2"; you may see these names in disassembled code, usually in operations involving type checking, array indexing, and foreign memory and function access.
(only) two "dedicated" registers.

RSP and RBP have dedicated functionality dictated by the hardware and calling conventions.
11 "node" registers.

All other registers (RBX, RSI, RDI, and R8-R15) are asserted to contain node values at (almost) all times; legacy "string" operations that implicitly use RSI and/or RDI are not used.

On 32-bit x86, the default register partitioning scheme involves:

A single "immediate" register.

The EAX register is given the symbolic name "imm0".
There are two "dedicated" registers.

ESP and EBP have dedicated functionality dictated by the hardware and calling conventions.
5 "node" registers.

The remaining registers, (EBX, ECX, EDX, ESI, EDI) normally contain node values. As on x86-64, string instructions that implicitly use ESI and EDI are not used.

There are times when this default partitioning scheme is inadequate. As mentioned in the x86-64 section, there are instructions like the extended-precision MUL and DIV which require the use of EAX and EDX. We therefore need a way to change this partitioning at run-time.

Two schemes are employed. The first uses a mask in the TCR that contains a bit for each register. If the bit is set, the register is interpreted by the GC as a node register; if it's clear, the register is treated as an immediate register. The second scheme uses the direction flag in the EFLAGS register. If DF is set, EDX is treated as an immediate register. (We don't use the string instructions, so DF isn't otherwise used.)

On the PPC, the static register partitioning scheme involves:

6 "immediate" registers.

Registers r3-r8 are given the symbolic names imm0-imm5. As a RISC architecture with simpler addressing modes, the PPC probably uses immediate registers a bit more often than the CISC x86-64 does, but they're generally used for the same sort of things (type checking, array indexing, FFI, etc.)
9 dedicated registers
- r0 (symbolic name rzero) always contains the value 0 when running lisp code. Its value is sometimes read as 0 when it's used as the base register in a memory address; keeping the value 0 there is sometimes convenient and avoids asymmetry.
- r1 (symbolic name sp) is the control stack pointer, by PPC convention.
- r2 is used to hold the current thread's TCR on ppc64 systems; it's not used on ppc32.
- r9 and r10 (symbolic names allocptr and allocbase) are used to do per-thread memory allocation
- r11 (symbolic name nargs) contains the number of function arguments on entry and the number of return values in multiple-value returning constructs. It's not used more generally as either a node or immediate register because of the way that certain trap instruction encodings are interpreted.
- r12 (symbolic name tsp) holds the top of the current thread's temp stack.
- r13 is used to hold the TCR on PPC32 systems; it's not used on PPC64.
- r14 (symbolic name loc-pc) is used to copy "pc-locative" values between main memory and special-purpose PPC registers (LR and CTR) used in function-call and return instructions.
- r15 (symbolic name vsp) addresses the top of the current thread's value stack.
- lr and ctr are PPC branch-unit registers used in function call and return instructions; they're always treated as "pc-locatives", which precludes the use of the ctr in some PPC looping constructs.
17 "node" registers

r15-r31 are always treated as node registers

Tagging scheme

Clozure CL always allocates lisp objects on double-node (64-bit for 32-bit platforms, 128-bit for 64-bit platforms) boundaries; this mean that the low 3 bits (32-bit lisp) or 4 bits (64-bit lisp) are always 0 and are therefore redundant (we only really need to know the upper 29 or 60 bits in order to identify the aligned object address.) The extra bits in a lisp node can be used to encode at least some information about the node's type, and the other 29/60 bits represent either an immediate value or a doublenode-aligned memory address. The low 3 or 4 bits of a node are called the node's "tag bits", and the conventions used to encode type information in those tag bits are called a "tagging scheme."

It might be possible to use the same tagging scheme on all platforms (at least on all platforms with the same word size and/or the same number of available tag bits), but there are often some strong reasons for not doing so. These arguments tend to be very machine-specific: sometimes, there are fairly obvious machine-dependent tricks that can be exploited to make common operations on some types of tagged objects faster; other times, there are architectural restrictions that make it impractical to use certain tags for certain types. (On PPC64, the "ld" (load doubleword) and "std" (store doubleword) instructions - which load and store a GPR operand at the effective address formed by adding the value of another GPR operand and a 16-bit constant operand - require that the low two bits of that constant operand be 0. Since such instructions would typically be used to access the fields of things like CONS cells and structures, it's desirable that that the tags chosen for CONS cells and structures allow the use of these instructions as opposed to more expensive alternatives.)

One architecture-dependent tagging trick that works well on all architectures is to use a tag of 0 for FIXNUMs: a fixnum basically encodes its value shifted left a few bits and keeps those low bits clear. FIXNUM addition, subtraction, and binary logical operations can operate directly on the node operands, addition and subtraction can exploit hardware-based overflow detection, and (in the absence of overflow) the hardware result of those operations is a node (fixnum). Some other slightly-less-common operations may require a few extra instructions, but arithmetic operations on FIXNUMs should be as cheap as possible and using a tag of zero for FIXNUMs helps to ensure that it will be.

If we have N available tag bits (N = 3 for 32-bit Clozure CL and N = 4 for 64-bit Clozure CL), this way of representing fixnums with the low M bits forced to 0 works as long as M <= N. The smaller we make M, the larger the values of MOST-POSITIVE-FIXNUM and MOST-NEGATIVE become; the larger we make N, the more distinct non-FIXNUM tags become available. A reasonable compromise is to choose M = N-1; this basically yields two distinct FIXNUM tags (one for even fixnums, one for odd fixnums), gives 30-bit fixnums on 32-bit platforms and 61-bit fixnums on 64-bit platforms, and leaves us with 6 or 14 tags to encoded other types.

Once we get past the assignment of FIXNUM tags, things quickly devolve into machine-dependencies. We can fairly easily see that we can't directly tag all other primitive lisp object types with only 6 or 14 available tag values; the details of how types are encoded vary between the ppc32, ppc64, and x86-64 implementations, but there are some general common principles:

CONS cells always contain exactly 2 elements and are usually fairly common.It therefore makes sense to give CONS cells their own tag. Unlike the fixnum case - where a tag value of 0 had positive implications - there doesn't seem to be any advantage to using any particular value. (A longtime ago - in the case of 68K MCL - the CONS tag and the order of CAR and CDR in memory were chosen to allow smaller, cheaper addressing modes to be used to "cdr down a list." That's not a factor on ppc or x86-64, but all versions of Clozure CL still store the CDR of a CONS cell first in memory. It doesn't matter, but doing it the way that the host system did made bootstrapping to a new target system a little easier.)
Any way you look at it, NIL is a bit ... unusual. NIL is both a SYMBOL and a LIST (as well as being a canonical truth value and probably a few other things.) Its role as a LIST is probably much more important to most programs than its role as a SYMBOL is: LISTP has to be true of NIL and primitives like CAR and CDR do LISTP implicitly when safe and want that operation to be fast. There are several possible approaches to this problem; Clozure CL uses two of them. On PPC32 and X86-64, NIL is basically a weird CONS cell that straddles two doublenodes; the tag of NIL is unique and congruent modulo 4 (modulo 8 on 64-bit) with the tag used for CONS cells. LISTP is therefore true of any node whose low 2 (or 3) bits contain the appropriate tag value (it's not otherwise necessary to special-case NIL.) SYMBOL accessors (SYMBOL-NAME, SYMBOL-VALUE, SYMBOL-PLIST ..) -do- have to special-case NIL (and access the components of an internal proxy symbol.) On PPC64 (where architectural restrictions dictate the set of tags that can be used to access fixed components of an object), that approach wasn't practical. NIL is just a distinguished SYMBOL,and it just happens to be the case that its pname slot and values slot are at the same offsets from a tagged pointer as a CONS cell's CDR and CAR would be. NIL's pname is set to NIL (SYMBOL-NAME checks for this and returns the string "NIL"), and LISTP (and therefore safe CAR and CDR) has to check for (OR NULL CONSP). At least in the case of CAR and CDR, the fact that the PPC has multiple condition-code fields keeps that extra test from being prohibitively expensive. On IA-32, we can't afford to dedicate a tag to NIL. NIL is therefore just a distinguished CONS cell, and we have to explicitly check for a NIL argument in CONSP/RPLACA/RPLACD.
Some objects are immediate (but not FIXNUMs). This is true of CHARACTERs and, on 64-bit platforms, SINGLE-FLOATs. It's also true of some nodes used in the runtime system (special values used to indicate unbound variables and slots, for instance.) On 64-bit platforms, SINGLE-FLOATs have their own unique tag (making them a little easier to recognize; on all platforms, CHARACTERs share a tag with other immediate objects (unbound markers) but are easy to recognize (by looking at several of their low bits.) The GC treats any node with an immediate tag (and any node with a fixnum tag) as a leaf.
There are some advantages to treating everything else-memory-allocated objects that aren't CONS cells-uniformly.There are some disadvantages to that uniform treatment as well, and the treatment of "memory-allocated non-CONS objects" isn't entirely uniform across all Clozure CL implementations. Let's first pretend that the treatment is uniform, then discuss the ways in which it isn't.The "uniform approach" is to treat all memory-allocated non-CONS objects as if they were vectors; this use of the term is a little looser than what's implied by the CL VECTOR type. Clozure CL actually uses the term "uvector" to mean "a memory-allocated lisp object other than a CONS cell, whose first word is a header that describes the object's type and the number of elements that it contains." In this view, a SYMBOL is a UVECTOR, as is a STRING, a STANDARD-INSTANCE, a CL array or vector, a FUNCTION, and even a DOUBLE-FLOAT. In the PPC implementations (where things are a little more ... uniform), a single tag value is used to denote any uvector; in order to determine something more specific about the type of the object in question, it's necessary to fetch the low byte of the header word from memory. On the x86-64 platform, certain types of uvectors - SYMBOLs and FUNCTIONs -are given their own unique tags. The good news about the x86-64 approach is that SYMBOLs and FUNCTIONs can be recognized without referencing memory; the slightly bad news is that primitive operations that work on UVECTOR-tagged objects-like the function CCL:UVREF-don't work on SYMBOLs or FUNCTIONs on x86-64 (but -do- work on those types of objects in the PPC ports.) The header word that precedes a UVECTOR's data in memory contains 8 bits of type information in the low byte and either 24 or 56 bits of "element-count" information in the rest of the word. (This is where the sometimes-limiting value of 2^24 for ARRAY-TOTAL-SIZE-LIMIT on 32-bit platforms comes from.) The low byte of the header-sometimes called the uvector's subtag-is itself tagged (which means that the header is tagged.) The (3 or 4) tag bits in the subtag are used to determine whether the uvector's elements are nodes or immediates. (A UVECTOR whose elements are nodes is called a GVECTOR; a UVECTOR whose elements are immediates is called an IVECTOR. This terminology came from Spice Lisp, which was a predecessor of CMUCL.) Even though a uvector header is tagged, a header is not a node. There's no (supported) way to get your hands on one in lisp and doing so could be dangerous. (If the value of a header wound up in a lisp node register and that register wound up getting pushed on a thread's value stack, the GC might misinterpret that situation to mean that there was a stack-allocated UVECTOR on the value stack.)

Heap Allocation

When the Clozure CL kernel first starts up, a large contiguous chunk of the process's address space is mapped as "anonymous, no access" memory. ("Large" means different things in different contexts; on LinuxPPC32, it means "about 1 gigabyte", on DarwinPPC32, it means "about 2 gigabytes", and on current 64-bit platforms it ranges from 128 to 512 gigabytes, depending on OS. These values are both defaults and upper limits; the --heap-reserve argument can be used to try to reserve less than the default.)

Reserving address space that can't (yet) be read or written to doesn't cost much; in particular, it doesn't require that corresponding swap space or physical memory be available. Marking the address range as being "mapped" helps to ensure that other things (results from random calls to malloc(), dynamically loaded shared libraries) won't be allocated in this region that lisp has reserved for its own heap growth.

A small portion (around 1/32 on 32-bit platforms and 1/64 on 64-bit platforms) of that large chunk of address space is reserved for GC data structures. Memory pages reserved for these data structures are mapped read-write as pages are made writable in the main portion of the heap.

The initial heap image is mapped into this reserved address space and an additional (LISP-HEAP-GC-THRESHOLD) bytes are mapped read-write. GC data structures grow to match the amount of GC-able memory in the initial image plus the gc threshold, and control is transferred to lisp code. Inevitably, that code spoils everything and starts consing; there are basically three layers of memory allocation that can go on.

Per-thread object allocation

Each lisp thread has a private "reserved memory segment"; when a thread starts up, its reserved memory segment is empty. PPC ports maintain the highest unallocated address and the lowest allocatable address in the current segment in registers when running lisp code; on x86-664, these values are maintained in the current threads's TCR. (An "empty" heap segment is one whose high pointer and low pointer are equal.) When a thread is not in the middle of allocating something, the low 3 or 4 bits of the high and low pointers are clear (the pointers are doublenode-aligned.)

A thread tries to allocate an object whose physical size in bytes is X and whose tag is Y by:

decrementing the "high" pointer by (- X Y)
trapping if the high pointer is less than the low pointer
using the (tagged) high pointer to initialize the object, if necessary
clearing the low bits of the high pointer

On PPC32, where the size of a CONS cell is 8 bytes and the tag of a CONS cell is 1, machine code which sets the arg_z register to the result of doing (CONS arg_y arg_z) looks like:


  (SUBI ALLOCPTR ALLOCPTR 7)    ; decrement the high pointer by (- 8 1)
  (TWLLT ALLOCPTR ALLOCBASE)    ; trap if the high pointer is below the base
  (STW ARG_Z -1 ALLOCPTR)       ; set the CDR of the tagged high pointer
  (STW ARG_Y 3 ALLOCPTR)        ; set the CAR
  (MR ARG_Z ALLOCPTR)           ; arg_z is the new CONS cell
  (RLWINM ALLOCPTR ALLOCPTR 0 0 28)     ; clear tag bits

On x86-64, the idea's similar but the implementation is different. The high and low pointers to the current thread's reserved segment are kept in the TCR, which is addressed by the gs segment register. An x86-64 CONS cell is 16 bytes wide and has a tag of 3; we canonically use the temp0 register to initialize the object


  (subq ($ 13) ((% gs) 216))    ; decrement allocptr
  (movq ((% gs) 216) (% temp0)) ; load allocptr into temp0
  (cmpq ((% gs) 224) (% temp0)) ; compare to allocabase
  (jg L1)                       ; skip trap
  (uuo-alloc)                   ; uh, don't skip trap
L1
  (andb ($ 240) ((% gs) 216))   ; untag allocptr in the tcr
  (movq (% arg_y) (5 (% temp0))) ; set the car
  (movq (% arg_z) (-3 (% temp0))); set the cdr
  (movq (% temp0) (% arg_z))    ; return the cons

If we don't take the trap (if allocating 8-16 bytes doesn't exhaust the thread's reserved memory segment), that's a fairly short and simple instruction sequence. If we do take the trap, we'll have to do some additional work in order to get a new segment for the current thread.

Allocation of reserved heap segments

After the lisp image is first mapped into memory - and after each full GC - the lisp kernel ensures that (LISP-HEAP-GC-THRESHOLD) additional bytes beyond the current end of the heap are mapped read-write.

If a thread traps while trying to allocate memory, the thread goes through the usual exception-handling protocol (to ensure that any other thread that GCs "sees" the state of the trapping thread and to serialize exception handling.) When the exception handler runs, it determines the nature and size of the failed allocation and tries to complete the allocation on the thread's behalf (and leave it with a reasonably large thread-specific memory segment so that the next small allocation is unlikely to trap.

Depending on the size of the requested segment allocation, the number of segment allocations that have occurred since the last GC, and the EGC and GC thresholds, the segment allocation trap handler may invoke a full or ephemeral GC before returning a new segment. It's worth noting that the [E]GC is triggered based on the number of and size of these segments that have been allocated since the last GC; it doesn't have much to do with how "full" each of those per-thread segments are. It's possible for a large number of threads to do fairly incidental memory allocation and trigger the GC as a result; avoiding this involves tuning the per-thread allocation quantum and the GC/EGC thresholds appropriately.

Heap growth

All OSes on which Clozure CL currently runs use an "overcommit" memory allocation strategy by default (though some of them provide ways of overriding that default.) What this means in general is that the OS doesn't necessarily ensure that backing store is available when asked to map pages as read-write; it'll often return a success indicator from the mapping attempt (mapping the pages as "zero-fill, copy-on-write"), and only try to allocate the backing store (swap space and/or physical memory) when non-zero contents are written to the pages.

It -sounds- like it'd be better to have the mmap() call fail immediately, but it's actually a complicated issue. (It's possible that other applications will stop using some backing store before lisp code actually touches the pages that need it, for instance.) It's also not guaranteed that lisp code would be able to "cleanly" signal an out-of-memory condition if lisp is ... out of memory

I don't know that I've ever seen an abrupt out-of-memory failure that wasn't preceded by several minutes of excessive paging activity. The most expedient course in cases like this is to either (a) use less memory or (b) get more memory; it's generally hard to use memory that you don't have.

GC details

The GC uses a Mark/Compact algorithm; its execution time is essentially a factor of the amount of live data in the heap. (The somewhat better-known Mark/Sweep algorithms don't compact the live data but instead traverse the garbage to rebuild free-lists; their execution time is therefore a factor of the total heap size.)

As mentioned in Heap Allocation, two auxiliary data structures (proportional to the size of the lisp heap) are maintained. These are

the markbits bitvector, which contains a bit for every doublenode in the dynamic heap (plus a few extra words for alignment and so that sub-bitvectors can start on word boundaries.)
the relocation table, which contains a native word for every 32 or 64 doublenodes in the dynamic heap, plus an extra word used to keep track of the end of the heap.

The total GC space overhead is therefore on the order of 3% (2/64 or 1/32).

The general algorithm proceeds as follows:

Mark phase

Each doublenode in the dynamic heap has a corresponding bit in the markbits vector. (For any doublenode in the heap, the index of its mark bit is determined by subtracting the address of the start of the heap from the address of the object and dividing the result by 8 or 16.) The GC knows the markbit index of the free pointer, so determining that the markbit index of a doubleword address is between the start of the heap and the free pointer can be done with a single unsigned comparison.

The markbits of all doublenodes in the dynamic heap are zeroed before the mark phase begins. An object is marked if the markbits of all of its constituent doublewords are set and unmarked otherwise; setting an object's markbits involves setting the corresponding markbits of all constituent doublenodes in the object.

The mark phase traverses each root. If the tag of the value of the root indicates that it's a non-immediate node whose address lies in the lisp heap, then:

If the object is already marked, do nothing.
Set the object's markbit(s).
If the object is an ivector, do nothing further.
If the object is a cons cell, recursively mark its car and cdr.
Otherwise, the object is a gvector. Recursively mark its elements.

Marking an object thus involves ensuring that its mark bits are set and then recursively marking any pointers contained within the object if the object was originally unmarked. If this recursive step was implemented in the obvious manner, marking an object would take stack space proportional to the length of the pointer chain from some root to that object. Rather than storing that pointer chain implicitly on the stack (in a series of recursive calls to the mark subroutine), the Clozure CL marker uses mixture of recursion and a technique called link inversion to store the pointer chain in the objects themselves. (Recursion tends to be simpler and faster; if a recursive step notes that stack space is becoming limited, the link-inversion technique is used.)

Certain types of objects are treated a little specially:

To support a feature called GCTWA (an acronym that I believe comes from MACLISP, where it stood for "Garbage Collection of Truly Worthless Atoms"), the vector that contains the internal symbols of the current package is marked on entry to the mark phase, but the symbols themselves are not marked at this time. Near the end of the mark phase, symbols referenced from this vector which are not otherwise marked are marked if and only if they're somehow distinguishable from newly created symbols (by virtue of their having function bindings, value bindings, plists, or other attributes.)
Pools have their first element set to NIL before any other elements are marked.
All hash tables have certain fields (used to cache previous results) invalidated.
Weak Hash Tables and other weak objects are put on a linkedlist as they're encountered; their contents are only retained if there are other (non-weak) references to them.

At the end of the mark phase, the markbits of all objects that are transitively reachable from the roots are set and all other markbits are clear.

Relocation phase

The forwarding address of a doublenode in the dynamic heap is (<its current address> - (size_of_doublenode * <the number of unmarked markbits that precede it>)) or alternately (<the base of the heap> + (size_of_doublenode * <the number of marked markbits that precede it >)). Rather than count the number of preceding markbits each time, the relocation table is used to precompute an approximation of the forwarding addresses for all doublewords. Given this approximate address and a pointer into the markbits vector, it's relatively easy to compute the exact forwarding address.

The relocation table contains the forwarding addresses of each pagelet, where a pagelet is 256 bytes (or 32 doublenodes). The forwarding address of the first pagelet is the base of the heap. The forwarding address of the second pagelet is the sum of the forwarding address of the first and 8 bytes for each mark bit set in the first 32-bit word in the markbits table. The last entry in the relocation table contains the forwarding address that the freepointer would have, e.g., the new value of the freepointer after compaction.

In many programs, old objects rarely become garbage and new objects often do. When building the relocation table, the relocation phase notes the address of the first unmarked object in the dynamic heap. Only the area of the heap between the first unmarked object and the freepointer needs to be compacted; only pointers to this area will need to be forwarded (the forwarding address of all other pointers to the dynamic heap is the address of that pointer.) Often, the first unmarked object is much nearer the free pointer than it is to the base of the heap.

Forwarding phase

The forwarding phase traverses all roots and the "old" part of the dynamic heap (the part between the base of the heap and the first unmarked object.) All references to objects whose address is between the first unmarked object and the free pointer are updated to point to the address the object will have after compaction by using the relocation table and the markbits vector and interpolating.

The relocation table entry for the pagelet nearest the object is found. If the pagelet's address is less than the object's address, the number of set markbits that precede the object on the pagelet is used to determine the object's address; otherwise, the number of set markbits that follow the object on the pagelet is used.

Since forwarding views the heap as a set of doublewords, locatives are (mostly) treated like any other pointers. (The basic difference is that locatives may appear to be tagged as fixnums, in which case they're treated as word-aligned pointers into the object.)

If the forward phase changes the address of any hash table key in a hash table that hashes by address (e.g., an EQ hash table), it sets a bit in the hash table's header. The hash table code will rehash the hash table's contents if it tries to do a lookup on a key in such a table.

Profiling reveals that about half of the total time spent in the GC is spent in the subroutine which determines a pointer's forwarding address. Exploiting GCC-specific idioms, hand-coding the routine, and inlining calls to it could all be expected to improve GC performance.

Compact phase

The compact phase compacts the area between the first unmarked object and the freepointer so that it contains only marked objects. While doing so, it forwards any pointers it finds in the objects it copies.

When the compact phase is finished, so is the GC (more or less): the free pointer and some other data structures are updated and control returns to the exception handler that invoked the GC. If sufficient memory has been freed to satisfy any allocation request that may have triggered the GC, the exception handler returns; otherwise, a "seriously low on memory" condition is signaled, possibly after releasing a small emergency pool of memory.

The ephemeral GC

In the Clozure CL memory management scheme, the relative age of two objects in the dynamic heap can be determined by their addresses: if addresses X and Y are both addresses in the dynamic heap, X is younger than Y (X was created more recently than Y) if it is nearer to the free pointer (and farther from the base of the heap) than Y.

Ephemeral (or generational) garbage collectors attempt to exploit the following assumptions:

most newly created objects become garbage soon after they'recreated.
most objects that have already survived several GCs are unlikely to ever become garbage.
old objects can only point to newer objects as the result of a destructive modification (e.g., via SETF.)

By concentrating its efforts on (frequently and quickly) reclaiming newly created garbage, an ephemeral collector hopes to postpone the more costly full GC as long as possible. It's important to note that most programs create some long-lived garbage, so an EGC can't typically eliminate the need for full GC.

An EGC views each object in the heap as belonging to exactly one generation; generations are sets of objects that are related to each other by age: some generation is the youngest, some the oldest, and there's an age relationship between any intervening generations. Objects are typically assigned to the youngest generation when first allocated; any object that has survived some number of GCs in its current generation is promoted (or tenured) into an older generation.

When a generation is GCed, the roots consist of the stacks, registers, and global variables as always and also of any pointers to objects in that generation from other generations. To avoid the need to scan those (often large) other generations looking for such intergenerational references, the runtime system must note all such intergenerational references at the point where they're created (via Setf). (This is sometimes called "The Write Barrier": all assignments which might result in intergenerational references must be noted, as if the other generations were write-protected). The set of pointers that may contain intergenerational references is sometimes called the remembered set.

In Clozure CL's EGC, the heap is organized exactly the same as otherwise; "generations" are merely structures which contain pointers to regions of the heap (which is already ordered by age.) When a generation needs to be GCed, any younger generation is incorporated into it; all objects which survive a GC of a given generation are promoted into the next older generation. The only intergenerational references that can exist are therefore those where an old object is modified to contain a pointer to a new object.

The EGC uses exactly the same code as the full GC. When a given GC is "ephemeral",

the "base of the heap" used to determine an object's markbit address is the base of the generation being collected;
the markbits vector is actually a pointer into the middle of the global markbits table; preceding entries in this table are used to note doubleword addresses in older generations that (may) contain intergenerational references;
some steps (notably GCTWA and the handling of weak objects) are not performed;
the intergenerational references table is used to find additional roots for the mark and forward phases. If a bit is set in the intergenerational references table, that means that the corresponding doubleword (in some "old" generation, in some "earlier" part of the heap) may have had a pointer to an object in a younger generation stored into it.

With one exception (the implicit setfs that occur on entry to and exit from the binding of a special variable), all setfs that might introduce an intergenerational reference must be memoized. Note that the implicit setfs that occur when initializing an object - as in the case of a call to cons or vector - can't introduce intergenerational references, since the newly created object is always younger than the objects used to initialize it. It's always safe to push any cons cell or gvector locative onto the memo stack; it's never safe to push anything else.

Typically, the intergenerational references bitvector is sparse: a relatively small number of old locations are stored into, although some of them may have been stored into many times. The routine that scans the memoization buffer does a lot of work and usually does it fairly often; it uses a simple, brute-force method but might run faster if it was smarter about recognizing addresses that it'd already seen.

When the EGC mark and forward phases scan the intergenerational reference bits, they can clear any bits that denote doublewords that definitely do not contain intergenerational references.

Fasl files

Saving and loading of Fasl files is implemented in xdump/faslenv.lisp, level-0/nfasload.lisp, and lib/nfcomp.lisp. The information here is only an overview, which might help when reading the source.

The Clozure CL Fasl format is forked from the old MCL Fasl format; there are a few differences, but they are minor. The name "nfasload" comes from the fact that this is the so-called "new" Fasl system, which was true in 1986 or so.

A Fasl file begins with a "file header", which contains version information and a count of the following "blocks". There's typically only one "block" per Fasl file. The blocks are part of a mechanism for combining multiple logical files into a single physical file, in order to simplify the distribution of precompiled programs.

Each block begins with a header for itself, which just describes the size of the data that follows.

The data in each block is treated as a simple stream of bytes, which define a bytecode program. The actual bytecodes, "fasl operators", are defined in xdump/faslenv.lisp. The descriptions in the source file are terse, but, according to Gary, "probably accurate".

Some of the operators are used to create a per-block "object table", which is a vector used to keep track of previously-loaded objects and simplify references to them. When the table is created, an index associated with it is set to zero; this is analogous to an array fill-pointer, and allows the table to be treated like a stack.

The low seven bits of each bytecode are used to specify the fasl operator; currently, about fifty operators are defined. The high byte, when set, indicates that the result of the operation should be pushed onto the object table.

Most bytecodes are followed by operands; the operand data is byte-aligned. How many operands there are, and their type, depend on the bytecode. Operands can be indices into the object table, immediate values, or some combination of these.

An exception is the bytecode #xFF, which has the symbolic name ccl::$faslend; it is used to mark the end of the block.

The Objective-C Bridge

How Clozure CL Recognizes Objective-C Objects

In most cases, pointers to instances of Objective-C classes are recognized as such; the recognition is (and probably always will be) slightly heuristic. Basically, any pointer that passes basic sanity checks and whose first word is a pointer to a known ObjC class is considered to be an instance of that class; the Objective-C runtime system would reach the same conclusion.

It's certainly possible that a random pointer to an arbitrary memory address could look enough like an ObjC instance to fool the lisp runtime system, and it's possible that pointers could have their contents change so that something that had either been a true ObjC instance (or had looked a lot like one) is changed (possibly by virtue of having been deallocated.)

In the first case, we can improve the heuristics substantially: we can make stronger assertions that a particular pointer is really "of type :ID" when it's a parameter to a function declared to take such a pointer as an argument or a similarly declared function result; we can be more confident of something we obtained via SLOT-VALUE of a slot defined to be of type :ID than if we just dug a pointer out of memory somewhere.

The second case is a little more subtle: ObjC memory management is based on a reference-counting scheme, and it's possible for an object to ... cease to be an object while lisp is still referencing it. If we don't want to deal with this possibility (and we don't), we'll basically have to ensure that the object is not deallocated while lisp is still thinking of it as a first-class object. There's some support for this in the case of objects created with MAKE-INSTANCE, but we may need to give similar treatment to foreign objects that are introduced to the lisp runtime in other ways (as function arguments, return values, SLOT-VALUE results, etc. as well as those instances that are created under lisp control.)

This doesn't all work yet (in fact, not much of it works yet); in practice, this has not yet been as much of a problem as anticipated, but that may be because existing Cocoa code deals primarily with relatively long-lived objects such as windows, views, menus, etc.

Glossary of Terms

acode

An intermediate representation of Lisp code produced by the compiler front-end. Short for “alphatized code,” meaning “has undergone alpha reduction,” which in turn means “all lambda-bound variables have been consistently renamed.”

dnode

Two native words. Lisp heap memory is allocated in units of dnodes.

gvector

A uvector whose elements are nodes.

immediate

A value in memory or a register that is a raw, untagged value.

ivector

A uvector whose elements are immediates.

LAP

An old acronym for Lisp Assembly Program. A notation for writing assembly language in Lisp-like syntax. Not surprisingly, the exact LAP syntax varies, depending on processor architecture.

node

A value in memory or a register that is a tagged lisp datum.

punt, puntable

If a variable is bound to a simple expression and is never setq’d, then all references to the variable can be replaced by references to the simple expression in question. This process is called “punting.”

tagged return address

On x86, the call instruction unconditionally pushes a return address onto the stack. The compiler aligns the call instruction (via insertion of nop instructions) such that the pushed return address will have a special tag that the GC can recognize.

uuo

An “unimplemented user operation.” Sometimes called a trap (after the PowerPC instructions). An illegal instruction used as a way for lisp code to request service from the lisp kernel (such as allocate more memory, initiate a GC, and so forth).

uvector

A memory-allocated object with a header word that describes the object’s type and the number of elements it contains. (The name comes from Spice Lisp.)

value stack

A stack that contains tagged lisp objects (nodes) such as function arguments, local variables, and other stack-allocated lisp objects. The contents of the value stack between its bottom and top are always unambiguously nodes.

vinsn

Virtual Instruction. A pre-assembled fragment of code used as a code generation template by the compiler backend.

Vinsns are written in a LAP-like notation, but a key difference is that vinsns are expanded/assembled as much as possible at vinsn definition time. When a vinsn is emitted at compile time, the operands in the vinsn template are then filled in.