Clozure CL Documentation

17.2.1. Overview

Regardless of other details of its implementation, a garbage collector's job is to partition the set of all heap-allocated lisp objects (CONSes, STRINGs, INSTANCEs, etc.) into two subsets. The first subset contains all objects that are transitively referenced from a small set of "root" objects (the contents of the stacks and registers of all active threads at the time the GC occurs and the values of some global variables.) The second subset contains everything else: those lisp objects that are not transitively reachable from the roots are garbage, and the memory occupied by garbage objects can be reclaimed (since the GC has just proven that it's impossible to reference them.)

The set of live, reachable lisp objects basically form the nodes of a (usually large) graph, with edges from each node A to any other objects (nodes) that object A references.

Some nodes in this graph can never have outgoing edges: an array with a specialized numeric or character type usually represents its elements in some (possibly more compact) specialized way. Some nodes may refer to lisp objects that are never allocated in memory (FIXNUMs, CHARACTERs, SINGLE-FLOATs on 64-bit platforms ..) This latter class of objects are sometimes called "immediates", but that's a little confusing because the term "immediate" is sometimes used to refer to things that can never be part of the big connectivity graph (e.g., the "raw" bits that make up a floating-point value, foreign address, or numeric value that needs to be used - at least fleetingly - in compiled code.)

For the GC to be able to build the connectivity graph reliably, it's necessary for it to be able to reliably tell (a) whether or not a "potential root" - the contents of a machine register or stack location - is in fact a node and (b) for any node, whether it may have components that refer to other nodes.

There's no reliable way to answer the first question on stock hardware. (If everything was a node, as might be the case on specially microcoded "lisp machine" hardware, it wouldn't even need to be asked.) Since there's no way to just look at a machine word (the contents of a machine register or stack location) and tell whether or not it's a node or just some random non-node value, we have to either adopt and enforce strict conventions on register and stack usage or tolerate ambiguity.

"Tolerating ambiguity" is an approach taken by some ("conservative") GC schemes; by contrast, Clozure CL's GC is "precise", which in this case means that it believes that the contents of certain machine registers and stack locations are always nodes and that other registers and stack locations are never nodes and that these conventions are never violated by the compiler or runtime system. The fact that threads are preemptively scheduled means that a GC could occur (because of activity in some other thread) on any instruction boundary, which in turn means that the compiler and runtime system must follow precise Section 17.2.3, “Register and stack usage conventions” at all times.

Once we've decided that a given machine word is a node, a Section 17.2.4, “Tagging scheme” describes how the node's value and type are encoded in that machine word.

Most of this discussion—so far—has treated things from the GC's very low-level perspective. From a much higher point of view, lisp functions accept nodes as arguments, return nodes as values, and (usually) perform some operations on those arguments in order to produce those results. (In many cases, the operations in question involve raw non-node values.) Higher-level parts of the lisp type system (functions like TYPE-OF and CLASS-OF, etc.) depend on the Section 17.2.4, “Tagging scheme”.

17.2.2. pc-locatives on the PPC

On the PPC, there's a third case (besides "node" and "immediate" values). As discussed below, a node that denotes a memory-allocated lisp object is a biased (tagged) pointer -to- that object; it's not generally possible to point -into- some composite (multi-element) object (such a pointer would not be a node, and the GC would have no way to update the pointer if it were to move the underlying object.)

Such a pointer ("into" the interior of a heap-allocated object) is often called a locative; the cases where locatives are allowed in Clozure CL mostly involve the behavior of function call and return instructions. (To be technically accurate, the other case also arises on x86-64, but that case isn't as user-visible.)

On the PowerPC (both PPC32 and PPC64), all machine instructions are 32 bits wide and all instruction words are allocated on 32-bit boundaries. In PPC Clozure CL, a CODE-VECTOR is a specialized type of vector-like object; its elements are 32-bit PPC machine instructions. A CODE-VECTOR is an attribute of a FUNCTION object; a function call involves accessing the function's code-vector and jumping to the address of its first instruction.

As each instruction in the code vector sequentially executes, the hardware program counter (PC) register advances to the address of the next instruction (a locative into the code vector); since PPC instructions are always 32 bits wide and aligned on 32-bit boundaries, the low two bits of the PC are always 0. If the function executes a call (simple call instructions have the mnemonic "bl" on the PPC, which stands for "branch and link"), the address of the next instruction (also a word-aligned locative into a code-vector) is copied into the special- purpose PPC "link register" (lr); a function returns to its caller via a "branch to link register" (blr) instruction. Some cases of function call and return might also use the PPC's "count register" (ctr), and if either the lr or ctr needs to be stored in memory it needs to first be copied to a general-purpose register.

Clozure CL's GC understands that certain registers contain these special "pc-locatives" (locatives that point into CODE-VECTOR objects); it contains special support for finding the containing CODE-VECTOR object and for adjusting all of these "pc-locatives" if the containing object is moved in memory. The first part of that operation—finding the containing object—is possible and practical on the PPC because of architectural artifacts (fixed-width instructions and arcana of instruction encoding.) It's not possible on x86-64, but fortunately not necessary either (though the second part - adjusting the PC/RIP when the containing object moves) is both necessary and simple.

17.2.3. Register and stack usage conventions

17.2.3.1. Stack conventions

On both PPC and X86 platforms, each lisp thread uses 3 stacks; the ways in which these stacks are used differs between the PPC and X86.

Each thread has:

A "control stack". On both platforms, this is "the stack" used by foreign code. On the PPC, it consists of a linked list of frames where the first word in each frame points to the first word in the previous frame (and the outermost frame points to 0.) Some frames on a PPC control stack are lisp frames; lisp frames are always 4 words in size and contain (in addition to the back pointer to the previous frame) the calling function (a node), the return address (a "locative" into the calling function's code-vector), and the value to which the value-stack pointer (see below) should be restored on function exit. On the PPC, the GC has to look at control-stack frames, identify which of those frames are lisp frames, and treat the contents of the saved function slot as a node (and handle the return address locative specially.) On x86-64, the control stack is used for dynamic-extent allocation of immediate objects. Since the control stack never contains nodes on x86-64, the GC ignores it on that platform. Alignment of the control stack follows the ABI conventions of the platform (at least at any point in time where foreign code could run.) On PPC, the r1 register always points to the top of the current thread's control stack; on x86-64, the RSP register points to the top of the current thread's control stack when the thread is running foreign code and the address of the top of the control stack is kept in the thread's TCR (see Section 17.1.1, “The Thread Context Record” when not running foreign code. The control stack "grows down."
A "value stack". On both platforms, all values on the value stack are nodes (including "tagged return addresses" on x86-64.) The value stack is always aligned to the native word size; objects are always pushed on the value stack using atomic instructions ("stwu"/"stdu" on PPC, "push" on x86-64), so the contents of the value stack between its bottom and top are always unambiguously nodes; the compiler usually tries to pop or discard nodes from the value stack as soon as possible after their last use (as soon as they may have become garbage.) On x86-64, the RSP register addresses the top of the value stack when running lisp code; that address is saved in the TCR when running foreign code. On the PPC, a dedicated register (VSP, currently r15) is used to address the top of the value stack when running lisp code, and the VSP value is saved in the TCR when running foreign code. The value stack grows down.
A "temp stack". The temp stack consists of a linked list of frames, each of which points to the previous temp stack frame. The number of native machine words in each temp stack frame is always even, so the temp stack is aligned on a two-word (64- or 128-bit) boundary. The temp stack is used for dynamic-extent objects on both platforms; on the PPC, it's used for essentially all such objects (regardless of whether or not the objects contain nodes); on the x86-64, immediate dynamic-extent objects (strings, foreign pointers, etc.) are allocated on the control stack and only node-containing dynamic-extent objects are allocated on the temp stack. Data structures used to implement CATCH and UNWIND-PROTECT are stored on the temp stack on both ppc and x86-64. Temp stack frames are always doublenode aligned and objects within a temp stack frame are aligned on doublenode boundaries. The first word in each frame contains a back pointer to the previous frame; on the PPC, the second word is used to indicate to the GC whether the remaining objects are nodes (if the second word is 0) or immediate (otherwise.) On x86-64, where temp stack frames always contain nodes, the second word is always 0. The temp stack grows down. It usually takes several instructions to allocate and safely initialize a temp stack frame that's intended to contain nodes, and the GC has to recognize the case where a thread is in the process of allocating and initializing a temp stack frame and take care not to interpret any uninitialized words in the frame as nodes. The PPC keeps the current top of the temp stack in a dedicated register (TSP, currently r12) when running lisp code and saves this register's value in the TCR when running foreign code. The x86-64 keeps the address of the top of each thread's temp stack in the thread's TCR.

17.2.3.2. Register conventions

If there are a "reasonable" (for some value of "reasonable") number of general-purpose registers and the instruction set is "reasonably" orthogonal (most instructions that operate on GPRs can operate on any GPR), then it's possible to statically partition the GPRs into at least two sets: "immediate registers" never contain nodes, and "node registers" always contain nodes. (On the PPC, a few registers are members of a third set of "PC locatives", and on both platforms some registers may have dedicated roles as stack or heap pointers; the latter class is treated as immediates by the GC proper but may be used to help determine the bounds of stack and heap memory areas.)

The ultimate definition of register partitioning is hardwired into the GC in functions like "mark_xp()" and "forward_xp()", which process the values of some of the registers in an exception frame as nodes and may give some sort of special treatment to other register values they encounter there.)

On x86-64, the static register partitioning scheme involves:

(only) three "immediate" registers.

The RAX, RCX, and RDX registers are used as the implicit operands and results of some extended-precision multiply and divide instructions which generally involve non-node values; since their use in these instructions means that they can't be guaranteed to contain node values at all times, it's natural to put these registers in the "immediate" set. RAX is generally given the symbolic name "imm0", RDX is given the symbolic name "imm1" and RCX is given the symbolic name "imm2"; you may see these names in disassembled code, usually in operations involving type checking, array indexing, and foreign memory and function access.
(only) two "dedicated" registers.

RSP and RBP have dedicated functionality dictated by the hardware and calling conventions.
11 "node" registers.

All other registers (RBX, RSI, RDI, and R8-R15) are asserted to contain node values at (almost) all times; legacy "string" operations that implicitly use RSI and/or RDI are not used.

On 32-bit x86, the default register partitioning scheme involves:

A single "immediate" register.

The EAX register is given the symbolic name "imm0".
There are two "dedicated" registers.

ESP and EBP have dedicated functionality dictated by the hardware and calling conventions.
5 "node" registers.

The remaining registers, (EBX, ECX, EDX, ESI, EDI) normally contain node values. As on x86-64, string instructions that implicity use ESI and EDI are not used.

There are times when this default partitioning scheme is inadequate. As mentioned in the x86-64 section, there are instructions like the extended-precision MUL and DIV which require the use of EAX and EDX. We therefore need a way to change this partitioning at run-time.

Two schemes are employed. The first uses a mask in the TCR that contains a bit for each register. If the bit is set, the register is interpreted by the GC as a node register; if it's clear, the register is treated as an immediate register. The second scheme uses the direction flag in the EFLAGS register. If DF is set, EDX is treated as an immediate register. (We don't use the string instructions, so DF isn't otherwise used.)

On the PPC, the static register partitioning scheme involves:

6 "immediate" registers.

Registers r3-r8 are given the symbolic names imm0-imm5. As a RISC architecture with simpler addressing modes, the PPC probably uses immediate registers a bit more often than the CISC x86-64 does, but they're generally used for the same sort of things (type checking, array indexing, FFI, etc.)
9 dedicated registers
- r0 (symbolic name rzero) always contains the value 0 when running lisp code. Its value is sometimes read as 0 when it's used as the base register in a memory address; keeping the value 0 there is sometimes convenient and avoids asymmetry.
- r1 (symbolic name sp) is the control stack pointer, by PPC convention.
- r2 is used to hold the current thread's TCR on ppc64 systems; it's not used on ppc32.
- r9 and r10 (symbolic names allocptr and allocbase) are used to do per-thread memory allocation
- r11 (symbolic name nargs) contains the number of function arguments on entry and the number of return values in multiple-value returning constructs. It's not used more generally as either a node or immediate register because of the way that certain trap instruction encodings are interpreted.
- r12 (symbolic name tsp) holds the top of the current thread's temp stack.
- r13 is used to hold the TCR on PPC32 systems; it's not used on PPC64.
- r14 (symbolic name loc-pc) is used to copy "pc-locative" values between main memory and special-purpose PPC registers (LR and CTR) used in function-call and return instructions.
- r15 (symbolic name vsp) addresses the top of the current thread's value stack.
- lr and ctr are PPC branch-unit registers used in function call and return instructions; they're always treated as "pc-locatives", which precludes the use of the ctr in some PPC looping constructs.
17 "node" registers

r15-r31 are always treated as node registers

17.2.4. Tagging scheme

Clozure CL always allocates lisp objects on double-node (64-bit for 32-bit platforms, 128-bit for 64-bit platforms) boundaries; this mean that the low 3 bits (32-bit lisp) or 4 bits (64-bit lisp) are always 0 and are therefore redundant (we only really need to know the upper 29 or 60 bits in order to identify the aligned object address.) The extra bits in a lisp node can be used to encode at least some information about the node's type, and the other 29/60 bits represent either an immediate value or a doublenode-aligned memory address. The low 3 or 4 bits of a node are called the node's "tag bits", and the conventions used to encode type information in those tag bits are called a "tagging scheme."

It might be possible to use the same tagging scheme on all platforms (at least on all platforms with the same word size and/or the same number of available tag bits), but there are often some strong reasons for not doing so. These arguments tend to be very machine-specific: sometimes, there are fairly obvious machine-dependent tricks that can be exploited to make common operations on some types of tagged objects faster; other times, there are architectural restrictions that make it impractical to use certain tags for certain types. (On PPC64, the "ld" (load doubleword) and "std" (store doubleword) instructions - which load and store a GPR operand at the effective address formed by adding the value of another GPR operand and a 16-bit constant operand - require that the low two bits of that constant operand be 0. Since such instructions would typically be used to access the fields of things like CONS cells and structures, it's desirable that that the tags chosen for CONS cells and structures allow the use of these instructions as opposed to more expensive alternatives.)

One architecture-dependent tagging trick that works well on all architectures is to use a tag of 0 for FIXNUMs: a fixnum basically encodes its value shifted left a few bits and keeps those low bits clear. FIXNUM addition, subtraction, and binary logical operations can operate directly on the node operands, addition and subtraction can exploit hardware-based overflow detection, and (in the absence of overflow) the hardware result of those operations is a node (fixnum). Some other slightly-less-common operations may require a few extra instructions, but arithmetic operations on FIXNUMs should be as cheap as possible and using a tag of zero for FIXNUMs helps to ensure that it will be.

If we have N available tag bits (N = 3 for 32-bit Clozure CL and N = 4 for 64-bit Clozure CL), this way of representing fixnums with the low M bits forced to 0 works as long as M <= N. The smaller we make M, the larger the values of MOST-POSITIVE-FIXNUM and MOST-NEGATIVE become; the larger we make N, the more distinct non-FIXNUM tags become available. A reasonable compromise is to choose M = N-1; this basically yields two distinct FIXNUM tags (one for even fixnums, one for odd fixnums), gives 30-bit fixnums on 32-bit platforms and 61-bit fixnums on 64-bit platforms, and leaves us with 6 or 14 tags to encoded other types.

Once we get past the assignment of FIXNUM tags, things quickly devolve into machine-dependencies. We can fairly easily see that we can't directly tag all other primitive lisp object types with only 6 or 14 available tag values; the details of how types are encoded vary between the ppc32, ppc64, and x86-64 implementations, but there are some general common principles: