Previous Section Next Section Table of Contents Glossary Index

Chapter 4. Using Clozure CL

4.5. Unicode

All characters and strings in Clozure CL fully support Unicode by using UTF-32. There is only one CHARACTER type and one STRING type in Clozure CL. There has been a lot of discussion about this decision which can be found by searching the openmcl-devel archives at http://clozure.com/pipermail/openmcl-devel/. Suffice it to say that we decided that the simplicity and speed advantages of only supporting UTF-32 outweigh the space disadvantage.

4.5.1. Characters

There is one CHARACTER type in Clozure CL. All CHARACTERs are BASE-CHARs. CHAR-CODE-LIMIT is now #x110000, which means that all Unicode characters can be directly represented. As of Unicode 5.0, only about 100,000 of 1,114,112 possible CHAR-CODEs are actually defined. The function CODE-CHAR knows that certain ranges of code values (notably #xd800-#xddff) will never be valid character codes and will return NIL for arguments in that range, but may return a non-NIL value (an undefined/non-standard CHARACTER object) for other unassigned code values.

Clozure CL supports character names of the form u+xxxx—where x is a sequence of one or more hex digits. The value of the hex digits denotes the code of the character. The + character is optional, so #\u+0020, #\U0020, and #\U+20 all refer to the #\Space character.

Characters with codes in the range #xa0-#x7ff also have symbolic names These are the names from the Unicode standard with spaces replaced by underscores. So #\Greek_Capital_Letter_Epsilon can be used to refer to the character whose CHAR-CODE is #x395. To see the complete list of supported character names, look just below the definition for register-character-name in ccl:level-1;l1-reader.lisp.

4.5.2. External Formats

OPEN, LOAD, and COMPILE-FILE all take an :EXTERNAL-FORMAT keyword argument. The value of :EXTERNAL-FORMAT can be :DEFAULT (the default value), a line termination keyword (see Section 4.5.3, “Line Termination Keywords”), a character encoding keyword (see Section 4.5.4, “Character Encodings”), an external-format object created using CCL::MAKE-EXTERNAL-FORMAT (see make-external-format), or a plist with keys: :DOMAIN, :CHARACTER-ENCODING and :LINE-TERMINATION. If argument is a plist, the result of (APPLY #'MAKE-EXTERNAL-FORMAT argument) will be used.

If :DEFAULT is specified, then the value of CCL:*DEFAULT-EXTERNAL-FORMAT* is used. If no line-termination is specified, then the value of CCL:*DEFAULT-LINE-TERMINATION* is used, which defaults to :UNIX. If no character encoding is specified, then CCL:*DEFAULT-FILE-CHARACTER-ENCODING* is used for file streams and CCL:*DEFAULT-SOCKET-CHARACTER-ENCODING* is used for socket streams. The default, default character encoding is NIL which is a synonym for :ISO-8859-1.

Note that the set of keywords used to denote CHARACTER-ENCODINGs and the set of keywords used to denote line-termination conventions is disjoint: a keyword denotes at most a character encoding or a line termination convention, but never both.

EXTERNAL-FORMATs are objects (structures) with two read-only fields that can be accessed via the functions: EXTERNAL-FORMAT-LINE-TERMINATION and EXTERNAL-FORMAT-CHARACTER-ENCODING.

[Variable]

CCL:*DEFAULT-EXTERNAL-FORMAT*

Description:

The value of this variable is used when :EXTERNAL-FORMAT is unspecified or specified as :DEFAULT. It can meaningfully be given any value that can be used as an external-format (except for the value :DEFAULT.)

The initial value of this variable in Clozure CL is :UNIX, which is equivalent to (:LINE-TERMINATION :UNIX), among other things.

[Variable]

CCL:*DEFAULT-LINE-TERMINATION*

Description:

The value of this variable is used when an external-format doesn't specify a line-termination convention (or specifies it as :DEFAULT.) It can meaningfully be given any value that can be used as a line termination keyword (see Section 4.5.3, “Line Termination Keywords”).

The initial value of this variable in Clozure CL is :UNIX.

[Function]

make-external-format &key domain character-encoding line-termination => external-format
Either creates a new external format object, or return an existing one with the same specified slot values.

Arguments and Values:

domain---This is used to indicate where the external format is to be used. Its value can be almost anything. It defaults to NIL. There are two domains that have a pre-defined meaning in Clozure CL: :FILE indicates encoding for a file in the file system and :SOCKET indicates i/o to/from a socket. The value of domain affects the default values for character-encoding and line-termination.

character-encoding---A keyword that specifies the character encoding for the external format. Section 4.5.4, “Character Encodings”. Defaults to :DEFAULT which means if domain is :FILE use the value of the variable CCL:*DEFAULT-FILE-CHARACTER-ENCODING* and if domain is :SOCKET, use the value of the variable CCL:*DEFAULT-SOCKET-CHARACTER-ENCODING*. The initial value of both of these variables is NIL, which means the :ISO-8859-1 encoding.

line-termination---A keyword that indicates a line termination keyword Section 4.5.3, “Line Termination Keywords”. Defaults to :DEFAULT which means use the value of the variable CCL:*DEFAULT-LINE-TERMINATION*.

external-format---An external-format object as described above.

Description:

Despite the function's name, it doesn't necessarily create a new, unique EXTERNAL-FORMAT object: two calls to MAKE-EXTERNAL-FORMAT with the same arguments made in the same dynamic environment return the same (eq) object.

4.5.3. Line Termination Keywords

Line termination keywords indicate which characters are used to indicate the end of a line. On input, the external line termination characters are replaced by #\Newline and on output, #\Newlines are converted to the external line termination characters.

Table 4.1. Line Termination Keywords

keyword character(s)
:UNIX #\Linefeed
:MACOS #\Return
:CR #\Return
:CRLF #\Return #\Linefeed
:CP/M #\Return #\Linefeed
:MSDOS #\Return #\Linefeed
:DOS #\Return #\Linefeed
:WINDOWS #\Return #\Linefeed
:INFERRED see below
:UNICODE #\Line_Separator

:INFERRED means that a stream's line-termination convention is determined by looking at the contents of a file. It is only useful for FILE-STREAMs that're open for :INPUT or :IO. The first buffer full of data is examined, and if a #\Return character occurs before any #\Linefeed character, then the line termination type is set to :WINDOWS if that #\Return character is immediately followed by a #\Linefeed character and to :MACOS otherwise. If a #\Return character isn't found in the buffer or if #\Return is preceded by #\Linefeed, the file's line terminationt type is set to :UNIX.

4.5.4. Character Encodings

Internally, all characters and strings in Clozure CL are in UTF-32. Externally, files or socket streams may encode characters in a wide variety of ways. The International Organization for Standardization, widely known as ISO, defines many of these character encodings. Clozure CL implements some of these encodings as detailed below. These encodings are part of the specification of external formats Section 4.5.2, “External Formats”. When reading from a stream, characters are converted from the specified external character encoding to UTF-32. When writing to a stream, characters are converted from UTF-32 to the specified character encoding.

Internally, CHARACTER-ENCODINGs are objects (structures) that are named by character encoding keywords (:ISO-8859-1, :UTF-8, etc.). The structures contain attributes of the encoding and functions used to encode/decode external data, but unless you're trying to define or debug an encoding there's little reason to know much about the CHARACTER-ENCODING objects and it's usually preferable to refer to a character encoding by its name.

4.5.4.1. Encoding Problems

On output to streams with character encodings that can encode the full range of Unicode—and on input from any stream—"unencodable characters" are represented using the Unicode #\Replacement_Character (= #\U+fffd); the presence of such a character usually indicates that something got lost in translation. Either data wasn't encoded properly or there was a bug in the decoding process.

4.5.4.2. Byte Order Marks

The endianness of a character encoding is sometimes explicit, and sometimes not. For example, :UTF-16BE indicates big-endian, but :UTF-16 does not specify endianness. A byte order mark is a special character that may appear at the beginning of a stream of encoded characters to specify the endianness of a multi-byte character encoding. (It may also be used with UTF-8 character encodings, where it is simply used to indicate that the encoding is UTF-8.)

Clozure CL writes a byte order mark as the first character of a file or socket stream when the endianness of the character encoding is not explicit. Clozure CL also expects a byte order mark on input from streams where the endianness is not explicit. If a byte order mark is missing from input data, that data is assumed to be in big-endian order.

A byte order mark from a UTF-8 encoded input stream is not treated specially and just appears as a normal character from the input stream. It is probably a good idea to skip over this character.

4.5.4.3. DESCRIBE-CHARACTER-ENCODINGS

The set of character encodings supported by Clozure CL can be retrieved by calling CCL:DESCRIBE-CHARACTER-ENCODINGS.

[Function]

describe-character-encodings
Writes descriptions of defined character encodings to *terminal-io*.

Description:

Writes descriptions of all defined character encodings to *terminal-io*. These descriptions include the names of the encoding's aliases and a doc string which briefly describes each encoding's properties and intended use.

4.5.4.4. Supported Character Encodings

The list of supported encodings is reproduced here. Most encodings have aliases, e.g. the encoding named :ISO-8859-1 can also be referred to by the names :LATIN1 and :IBM819, among others. Where possible, the keywordized name of an encoding is equivalent to the preferred MIME charset name (and the aliases are all registered IANA charset names.)

:ISO-8859-1

An 8-bit, fixed-width character encoding in which all character codes map to their Unicode equivalents. Intended to support most characters used in most Western European languages.

Clozure CL uses ISO-8859-1 encoding for *TERMINAL-IO* and for all streams whose EXTERNAL-FORMAT isn't explicitly specified. The default for *TERMINAL-IO* can be set via the -K command-line argument (see Section 2.5, “Command Line Options”).

ISO-8859-1 just covers the first 256 Unicode code points, where the first 128 code points are equivalent to US-ASCII. That should be pretty much equivalent to what earliers versions of Clozure CL did that only supported 8-bit characters, but it may not be optimal for users working in a particular locale.

Aliases: :ISO_8859-1, :LATIN1, :L1, :IBM819, :CP819, :CSISOLATIN1

:ISO-8859-2

An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in most languages used in Central/Eastern Europe.

Aliases: :ISO_8859-2, :LATIN2, :L2, :CSISOLATIN2

:ISO-8859-3

An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in most languages used in Southern Europe.

Aliases: :ISO_8859-3, :LATIN3 :L3, :CSISOLATIN3

:ISO-8859-4

An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in most languages used in Northern Europe.

Aliases: :ISO_8859-4, :LATIN4, :L4, :CSISOLATIN4

:ISO-8859-5

An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in the Cyrillic alphabet.

Aliases: :ISO_8859-5, :CYRILLIC, :CSISOLATINCYRILLIC, :ISO-IR-144

:ISO-8859-6

An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in the Arabic alphabet.

Aliases: :ISO_8859-6, :ARABIC, :CSISOLATINARABIC, :ISO-IR-127

:ISO-8859-7

An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in the Greek alphabet.

Aliases: :ISO_8859-7, :GREEK, :GREEK8, :CSISOLATINGREEK, :ISO-IR-126, :ELOT_928, :ECMA-118

:ISO-8859-8

An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in the Hebrew alphabet.

Aliases: :ISO_8859-8, :HEBREW, :CSISOLATINHEBREW, :ISO-IR-138

:ISO-8859-9

An 8-bit, fixed-width character encoding in which codes #x00-#xcf map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in the Turkish alphabet.

Aliases: :ISO_8859-9, :LATIN5, :CSISOLATIN5, :ISO-IR-148

:ISO-8859-10

An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in Nordic alphabets.

Aliases: :ISO_8859-10, :LATIN6, :CSISOLATIN6, :ISO-IR-157

:ISO-8859-11

An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found the Thai alphabet.

:ISO-8859-13

An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in Baltic alphabets.

:ISO-8859-14

An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in Celtic languages.

Aliases: :ISO_8859-14, :ISO-IR-199, :LATIN8, :L8, :ISO-CELTIC

:ISO-8859-15

An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in Western European languages (including the Euro sign and some other characters missing from ISO-8859-1.

Aliases: :ISO_8859-15, :LATIN9

:ISO-8859-16

An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in Southeast European languages.

Aliases: :ISO_8859-16, :ISO-IR-199, :LATIN8, :L8, :ISO-CELTIC

:MACINTOSH

An 8-bit, fixed-width character encoding in which codes #x00-#x7f map to their Unicode equivalents and other codes map to other Unicode character values. Traditionally used on Classic MacOS to encode characters used in western languages.

Aliases: :MACOS-ROMAN, :MACOSROMAN, :MAC-ROMAN, :MACROMAN

:UCS-2

A 16-bit, fixed-length encoding in which characters with CHAR-CODEs less than #x10000 can be encoded in a single 16-bit word. The endianness of the encoded data is indicated by the endianness of a byte-order-mark character (#u+feff) prepended to the data; in the absence of such a character on input, the data is assumed to be in big-endian order.

:UCS-2BE

A 16-bit, fixed-length encoding in which characters with CHAR-CODEs less than #x10000 can be encoded in a single 16-bit big-endian word. The encoded data is implicitly big-endian; byte-order-mark characters are not interpreted on input or prepended to output.

:UCS-2LE

A 16-bit, fixed-length encoding in which characters with CHAR-CODEs less than #x10000 can be encoded in a single 16-bit little-endian word. The encoded data is implicitly little-endian; byte-order-mark characters are not interpreted on input or prepended to output.

:US-ASCII

An 7-bit, fixed-width character encoding in which all character codes map to their Unicode equivalents.

Aliases: :CSASCII, :CP637, :IBM637, :US, :ISO646-US, :ASCII, :ISO-IR-6

:UTF-16

A 16-bit, variable-length encoding in which characters with CHAR-CODEs less than #x10000 can be encoded in a single 16-bit word and characters with larger codes can be encoded in a pair of 16-bit words. The endianness of the encoded data is indicated by the endianness of a byte-order-mark character (#u+feff) prepended to the data; in the absence of such a character on input, the data is assumed to be in big-endian order. Output is written in native byte-order with a leading byte-order mark.

:UTF-16BE

A 16-bit, variable-length encoding in which characters with CHAR-CODEs less than #x10000 can be encoded in a single 16-bit big-endian word and characters with larger codes can be encoded in a pair of 16-bit big-endian words. The endianness of the encoded data is implicit in the encoding; byte-order-mark characters are not interpreted on input or prepended to output.

:UTF-16LE

A 16-bit, variable-length encoding in which characters with CHAR-CODEs less than #x10000 can be encoded in a single 16-bit little-endian word and characters with larger codes can be encoded in a pair of 16-bit little-endian words. The endianness of the encoded data is implicit in the encoding; byte-order-mark characters are not interpreted on input or prepended to output.

:UTF-32

A 32-bit, fixed-length encoding in which all Unicode characters can be encoded in a single 32-bit word. The endianness of the encoded data is indicated by the endianness of a byte-order-mark character (#u+feff) prepended to the data; in the absence of such a character on input, input data is assumed to be in big-endian order. Output is written in native byte order with a leading byte-order mark.

Alias: :UTF-4

:UTF-32BE

A 32-bit, fixed-length encoding in which all Unicode characters encoded in a single 32-bit word. The encoded data is implicitly big-endian; byte-order-mark characters are not interpreted on input or prepended to output.

Alias: :UCS-4BE

:UTF-8

An 8-bit, variable-length character encoding in which characters with CHAR-CODEs in the range #x00-#x7f can be encoded in a single octet; characters with larger code values can be encoded in 2 to 4 bytes.

:UTF-32LE

A 32-bit, fixed-length encoding in which all Unicode characters can encoded in a single 32-bit word. The encoded data is implicitly little-endian; byte-order-mark characters are not interpreted on input or prepended to output.

Alias: :UCS-4LE

:Windows-31j

An 8-bit, variable-length character encoding in which character code points in the range #x00-#x7f can be encoded in a single octet; characters with larger code values can be encoded in 2 bytes.

Aliases: :CP932, :CSWINDOWS31J

:EUC-JP

An 8-bit, variable-length character encoding in which character code points in the range #x00-#x7f can be encoded in a single octet; characters with larger code values can be encoded in 2 bytes.

Alias: :EUCJP

:GB2312

An 8-bit, variable-length character encoding in which character code points in the range #x00-#x80 can be encoded in a single octet; characters with larger code values can be encoded in 2 bytes.

Alias: :GB2312-80 :GB2312-1980 :EUC-CN :EUCCN

:CP936

An 8-bit, variable-length character encoding in which character code points in the range #x00-#x80 can be encoded in a single octet; characters with larger code values can be encoded in 2 bytes.

Alias: :GBK :MS936 :WINDOWS-936

4.5.4.5. Encoding and Decoding Strings

Clozure CL provides functions to encode and decode strings to and from vectors of type (simple-array (unsigned-byte 8)).

[Function]

count-characters-in-octet-vector vector &key start end external-format

Description:

Returns the number of characters that would be produced by decoding vector (or the subsequence thereof delimited by start and end) according to external-format.

[Function]

decode-string-from-octets vector &key start end external-format string

Description:

Decodes the octets in vector (or the subsequence of it delimited by start and end) into a string according to external-format.

If string is supplied, output will be written into it. It must be large enough to hold the decoded characters. If string is not supplied, a new string will be allocated to hold the decoded characters.

Returns, as multiple values, the decoded string and the position in vector where the decoding ended.

Sequences of octets in vector that cannot be decoded into characters according to external-format will be decoded as #\Replacement_Character.

[Function]

encode-string-to-octets string &key start end external-format use-byte-order-mark vector vector-offset

Description:

Encodes string (or the substring delimited by start and end) into external-format and returns, as multiple values, a vector of octets containing the encoded data and an integer that specifies the offset into the vector where the encoded data ends.

When use-byte-order-mark is true, a byte-order mark will be included in the encoded data.

If vector is supplied, output will be written to it. It must be of type (simple-array (unsigned-byte 8)) and be large enough to hold the encoded data. If it is not supplied, the function will allocate a new vector.

If vector-offset is supplied, data will be written into the output vector starting at that offset.

Characters in string that cannot be encoded into external-format will be replaced with an encoding-dependent replacement character (#\Replacement_Character or #\Sub) before being encoded and written into the output vector.

[Function]

string-size-in-octets string &key start end external-format use-byte-order-mark

Description:

Returns the number of octets required to encode string (or the substring delimited by start and end) into external-format.

When use-byte-order-mark is true, the returned size will include space for a byte-order marker.


Previous Section Next Section Table of Contents Glossary Index