Previous Section | Next Section | Table of Contents | Glossary | Index |
All characters and strings in Clozure CL fully support Unicode by
using UTF-32. There is only one CHARACTER
type
and one STRING
type in Clozure CL. There has been a
lot of discussion about this decision which can be found by
searching the openmcl-devel archives at http://clozure.com/pipermail/openmcl-devel/. Suffice it
to say that we decided that the simplicity and speed advantages of
only supporting UTF-32 outweigh the space disadvantage.
There is one CHARACTER
type in Clozure CL.
All CHARACTER
s are
BASE-CHAR
s. CHAR-CODE-LIMIT
is now #x110000
, which means that all Unicode
characters can be directly represented. As of Unicode 5.0, only
about 100,000 of 1,114,112 possible CHAR-CODE
s
are actually defined. The function CODE-CHAR
knows that certain ranges of code values (notably
#xd800
-#xddff
) will never be
valid character codes and will return NIL
for
arguments in that range, but may return a
non-NIL
value (an undefined/non-standard
CHARACTER
object) for other unassigned code
values.
Clozure CL supports character names of the form
u+xxxx
—where x
is a
sequence of one or more hex digits. The value of the hex digits
denotes the code of the character. The +
character is optional, so #\u+0020
,
#\U0020
, and #\U+20
all
refer to the #\Space
character.
Characters with codes in the range
#xa0
-#x7ff
also have
symbolic names These are the names from the Unicode standard with
spaces replaced by underscores. So
#\Greek_Capital_Letter_Epsilon
can be used to
refer to the character whose CHAR-CODE is
#x395
. To see the complete list of supported
character names, look just below the definition for
register-character-name in
ccl:level-1;l1-reader.lisp
.
OPEN, LOAD, and
COMPILE-FILE all take an
:EXTERNAL-FORMAT
keyword argument. The value
of :EXTERNAL-FORMAT
can be
:DEFAULT
(the default value), a line
termination keyword (see Section 4.5.3, “Line Termination Keywords”), a character encoding
keyword (see Section 4.5.4, “Character Encodings”), an
external-format object created using
CCL::MAKE-EXTERNAL-FORMAT (see make-external-format), or a plist with keys:
:DOMAIN
, :CHARACTER-ENCODING
and :LINE-TERMINATION
. If
argument
is a plist, the result of
(APPLY #'MAKE-EXTERNAL-FORMAT
will be used.argument
)
If :DEFAULT
is specified, then the value
of CCL:*DEFAULT-EXTERNAL-FORMAT* is used. If
no line-termination is specified, then the value of
CCL:*DEFAULT-LINE-TERMINATION* is used, which
defaults to :UNIX
. If no character encoding is
specified, then
CCL:*DEFAULT-FILE-CHARACTER-ENCODING* is used
for file streams and
CCL:*DEFAULT-SOCKET-CHARACTER-ENCODING* is used
for socket streams. The default, default character encoding is
NIL
which is a synonym for
:ISO-8859-1
.
Note that the set of keywords used to denote CHARACTER-ENCODINGs and the set of keywords used to denote line-termination conventions is disjoint: a keyword denotes at most a character encoding or a line termination convention, but never both.
EXTERNAL-FORMATs are objects (structures) with two read-only fields that can be accessed via the functions: EXTERNAL-FORMAT-LINE-TERMINATION and EXTERNAL-FORMAT-CHARACTER-ENCODING.
The value of this variable is used when :EXTERNAL-FORMAT is unspecified or specified as :DEFAULT. It can meaningfully be given any value that can be used as an external-format (except for the value :DEFAULT.)
The initial value of this variable
in Clozure CL is :UNIX
, which is equivalent to
(:LINE-TERMINATION :UNIX)
, among other
things.
The value of this variable is used when an external-format doesn't specify a line-termination convention (or specifies it as :DEFAULT.) It can meaningfully be given any value that can be used as a line termination keyword (see Section 4.5.3, “Line Termination Keywords”).
The initial value of this variable
in Clozure CL is :UNIX
.
domain---This is used to indicate where the external
format is to be used. Its value can be almost
anything. It defaults to NIL
.
There are two domains that have a pre-defined meaning in
Clozure CL: :FILE
indicates
encoding for a file in the file system and
:SOCKET
indicates i/o to/from a
socket. The value of domain
affects the default values for
character-encoding
and
line-termination
.
character-encoding---A keyword that specifies the character encoding
for the external format. Section 4.5.4, “Character Encodings”. Defaults to
:DEFAULT
which means if
domain
is
:FILE
use the value of the variable
CCL:*DEFAULT-FILE-CHARACTER-ENCODING*
and if domain
is
:SOCKET
, use the value of the
variable
CCL:*DEFAULT-SOCKET-CHARACTER-ENCODING*.
The initial value of both of these variables is
NIL
, which means the
:ISO-8859-1
encoding.
line-termination---A keyword that indicates a line termination
keyword Section 4.5.3, “Line Termination Keywords”.
Defaults to :DEFAULT
which means
use the value of the variable
CCL:*DEFAULT-LINE-TERMINATION*.
external-format---An external-format object as described above.
Line termination keywords indicate which characters are used
to indicate the end of a line. On input, the external line
termination characters are replaced by #\Newline
and on output, #\Newline
s are converted to the
external line termination characters.
Table 4.1. Line Termination Keywords
keyword | character(s) |
---|---|
:UNIX
|
#\Linefeed
|
:MACOS
|
#\Return
|
:CR
|
#\Return
|
:CRLF
|
#\Return #\Linefeed
|
:CP/M
|
#\Return #\Linefeed
|
:MSDOS
|
#\Return #\Linefeed
|
:DOS
|
#\Return #\Linefeed
|
:WINDOWS
|
#\Return #\Linefeed
|
:INFERRED
|
see below |
:UNICODE
|
#\Line_Separator
|
:INFERRED
means that a stream's
line-termination convention is determined by looking at the contents
of a file. It is only useful for FILE-STREAM
s
that're open for :INPUT
or
:IO
. The first buffer full of data is examined,
and if a #\Return
character occurs before any
#\Linefeed
character, then the line termination
type is set to :WINDOWS
if that
#\Return
character is immediately followed by a
#\Linefeed
character and to :MACOS
otherwise. If a #\Return
character isn't found in
the buffer or if #\Return
is preceded by
#\Linefeed
, the file's line terminationt type
is set to :UNIX
.
Internally, all characters and strings in Clozure CL are in UTF-32. Externally, files or socket streams may encode characters in a wide variety of ways. The International Organization for Standardization, widely known as ISO, defines many of these character encodings. Clozure CL implements some of these encodings as detailed below. These encodings are part of the specification of external formats Section 4.5.2, “External Formats”. When reading from a stream, characters are converted from the specified external character encoding to UTF-32. When writing to a stream, characters are converted from UTF-32 to the specified character encoding.
Internally, CHARACTER-ENCODINGs are objects (structures) that are named by character encoding keywords (:ISO-8859-1, :UTF-8, etc.). The structures contain attributes of the encoding and functions used to encode/decode external data, but unless you're trying to define or debug an encoding there's little reason to know much about the CHARACTER-ENCODING objects and it's usually preferable to refer to a character encoding by its name.
On output to streams with character encodings that can encode the full range of Unicode—and on input from any stream—"unencodable characters" are represented using the Unicode #\Replacement_Character (= #\U+fffd); the presence of such a character usually indicates that something got lost in translation. Either data wasn't encoded properly or there was a bug in the decoding process.
The endianness of a character encoding is sometimes
explicit, and sometimes not. For example,
:UTF-16BE
indicates big-endian, but
:UTF-16
does not specify endianness. A byte
order mark is a special character that may appear at the
beginning of a stream of encoded characters to specify the
endianness of a multi-byte character encoding. (It may also be
used with UTF-8 character encodings, where it is simply used to
indicate that the encoding is UTF-8.)
Clozure CL writes a byte order mark as the first character of a file or socket stream when the endianness of the character encoding is not explicit. Clozure CL also expects a byte order mark on input from streams where the endianness is not explicit. If a byte order mark is missing from input data, that data is assumed to be in big-endian order.
A byte order mark from a UTF-8 encoded input stream is not treated specially and just appears as a normal character from the input stream. It is probably a good idea to skip over this character.
The set of character encodings supported by Clozure CL can be retrieved by calling CCL:DESCRIBE-CHARACTER-ENCODINGS.
The list of supported encodings is reproduced here. Most
encodings have aliases, e.g. the encoding named
:ISO-8859-1
can also be referred to by the
names :LATIN1
and :IBM819
,
among others. Where possible, the keywordized name of an
encoding is equivalent to the preferred MIME charset name (and
the aliases are all registered IANA charset names.)
:ISO-8859-1
An 8-bit, fixed-width character encoding in which all character codes map to their Unicode equivalents. Intended to support most characters used in most Western European languages.
Clozure CL uses ISO-8859-1 encoding for
*TERMINAL-IO* and for all streams whose
EXTERNAL-FORMAT isn't explicitly specified. The default for
*TERMINAL-IO* can be set via the
-K
command-line argument (see Section 2.5, “Command Line Options”).
ISO-8859-1 just covers the first 256 Unicode code points, where the first 128 code points are equivalent to US-ASCII. That should be pretty much equivalent to what earliers versions of Clozure CL did that only supported 8-bit characters, but it may not be optimal for users working in a particular locale.
Aliases: :ISO_8859-1, :LATIN1, :L1,
:IBM819, :CP819, :CSISOLATIN1
:ISO-8859-2
An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in most languages used in Central/Eastern Europe.
Aliases: :ISO_8859-2, :LATIN2, :L2,
:CSISOLATIN2
:ISO-8859-3
An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in most languages used in Southern Europe.
Aliases: :ISO_8859-3, :LATIN3 :L3,
:CSISOLATIN3
:ISO-8859-4
An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in most languages used in Northern Europe.
Aliases: :ISO_8859-4, :LATIN4, :L4, :CSISOLATIN4
:ISO-8859-5
An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in the Cyrillic alphabet.
Aliases: :ISO_8859-5, :CYRILLIC, :CSISOLATINCYRILLIC,
:ISO-IR-144
:ISO-8859-6
An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in the Arabic alphabet.
Aliases: :ISO_8859-6, :ARABIC, :CSISOLATINARABIC,
:ISO-IR-127
:ISO-8859-7
An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in the Greek alphabet.
Aliases: :ISO_8859-7, :GREEK, :GREEK8, :CSISOLATINGREEK,
:ISO-IR-126, :ELOT_928, :ECMA-118
:ISO-8859-8
An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in the Hebrew alphabet.
Aliases: :ISO_8859-8, :HEBREW, :CSISOLATINHEBREW,
:ISO-IR-138
:ISO-8859-9
An 8-bit, fixed-width character encoding in which codes #x00-#xcf map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in the Turkish alphabet.
Aliases: :ISO_8859-9, :LATIN5, :CSISOLATIN5,
:ISO-IR-148
:ISO-8859-10
An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in Nordic alphabets.
Aliases: :ISO_8859-10, :LATIN6, :CSISOLATIN6,
:ISO-IR-157
:ISO-8859-11
An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found the Thai alphabet.
:ISO-8859-13
An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in Baltic alphabets.
:ISO-8859-14
An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in Celtic languages.
Aliases: :ISO_8859-14, :ISO-IR-199, :LATIN8, :L8,
:ISO-CELTIC
:ISO-8859-15
An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in Western European languages (including the Euro sign and some other characters missing from ISO-8859-1.
Aliases: :ISO_8859-15, :LATIN9
:ISO-8859-16
An 8-bit, fixed-width character encoding in which codes #x00-#x9f map to their Unicode equivalents and other codes map to other Unicode character values. Intended to provide most characters found in Southeast European languages.
Aliases: :ISO_8859-16, :ISO-IR-199, :LATIN8, :L8,
:ISO-CELTIC
:MACINTOSH
An 8-bit, fixed-width character encoding in which codes #x00-#x7f map to their Unicode equivalents and other codes map to other Unicode character values. Traditionally used on Classic MacOS to encode characters used in western languages.
Aliases: :MACOS-ROMAN, :MACOSROMAN, :MAC-ROMAN,
:MACROMAN
:UCS-2
A 16-bit, fixed-length encoding in which characters with CHAR-CODEs less than #x10000 can be encoded in a single 16-bit word. The endianness of the encoded data is indicated by the endianness of a byte-order-mark character (#u+feff) prepended to the data; in the absence of such a character on input, the data is assumed to be in big-endian order.
:UCS-2BE
A 16-bit, fixed-length encoding in which characters with CHAR-CODEs less than #x10000 can be encoded in a single 16-bit big-endian word. The encoded data is implicitly big-endian; byte-order-mark characters are not interpreted on input or prepended to output.
:UCS-2LE
A 16-bit, fixed-length encoding in which characters with CHAR-CODEs less than #x10000 can be encoded in a single 16-bit little-endian word. The encoded data is implicitly little-endian; byte-order-mark characters are not interpreted on input or prepended to output.
:US-ASCII
An 7-bit, fixed-width character encoding in which all character codes map to their Unicode equivalents.
Aliases: :CSASCII, :CP637, :IBM637, :US,
:ISO646-US, :ASCII, :ISO-IR-6
:UTF-16
A 16-bit, variable-length encoding in which characters with CHAR-CODEs less than #x10000 can be encoded in a single 16-bit word and characters with larger codes can be encoded in a pair of 16-bit words. The endianness of the encoded data is indicated by the endianness of a byte-order-mark character (#u+feff) prepended to the data; in the absence of such a character on input, the data is assumed to be in big-endian order. Output is written in native byte-order with a leading byte-order mark.
:UTF-16BE
A 16-bit, variable-length encoding in which characters with CHAR-CODEs less than #x10000 can be encoded in a single 16-bit big-endian word and characters with larger codes can be encoded in a pair of 16-bit big-endian words. The endianness of the encoded data is implicit in the encoding; byte-order-mark characters are not interpreted on input or prepended to output.
:UTF-16LE
A 16-bit, variable-length encoding in which characters with CHAR-CODEs less than #x10000 can be encoded in a single 16-bit little-endian word and characters with larger codes can be encoded in a pair of 16-bit little-endian words. The endianness of the encoded data is implicit in the encoding; byte-order-mark characters are not interpreted on input or prepended to output.
:UTF-32
A 32-bit, fixed-length encoding in which all Unicode characters can be encoded in a single 32-bit word. The endianness of the encoded data is indicated by the endianness of a byte-order-mark character (#u+feff) prepended to the data; in the absence of such a character on input, input data is assumed to be in big-endian order. Output is written in native byte order with a leading byte-order mark.
Alias: :UTF-4
:UTF-32BE
A 32-bit, fixed-length encoding in which all Unicode characters encoded in a single 32-bit word. The encoded data is implicitly big-endian; byte-order-mark characters are not interpreted on input or prepended to output.
Alias: :UCS-4BE
:UTF-8
An 8-bit, variable-length character encoding in which characters with CHAR-CODEs in the range #x00-#x7f can be encoded in a single octet; characters with larger code values can be encoded in 2 to 4 bytes.
:UTF-32LE
A 32-bit, fixed-length encoding in which all Unicode characters can encoded in a single 32-bit word. The encoded data is implicitly little-endian; byte-order-mark characters are not interpreted on input or prepended to output.
Alias: :UCS-4LE
:Windows-31j
An 8-bit, variable-length character encoding in which character code points in the range #x00-#x7f can be encoded in a single octet; characters with larger code values can be encoded in 2 bytes.
Aliases: :CP932, :CSWINDOWS31J
:EUC-JP
An 8-bit, variable-length character encoding in which character code points in the range #x00-#x7f can be encoded in a single octet; characters with larger code values can be encoded in 2 bytes.
Alias: :EUCJP
:GB2312
An 8-bit, variable-length character encoding in which character code points in the range #x00-#x80 can be encoded in a single octet; characters with larger code values can be encoded in 2 bytes.
Alias: :GB2312-80 :GB2312-1980 :EUC-CN :EUCCN
:CP936
An 8-bit, variable-length character encoding in which character code points in the range #x00-#x80 can be encoded in a single octet; characters with larger code values can be encoded in 2 bytes.
Alias: :GBK :MS936 :WINDOWS-936
Clozure CL provides functions to encode and decode strings to and from vectors of type (simple-array (unsigned-byte 8)).
Decodes the octets in vector (or the subsequence of it delimited by start and end) into a string according to external-format.
If string is supplied, output will be written into it. It must be large enough to hold the decoded characters. If string is not supplied, a new string will be allocated to hold the decoded characters.
Returns, as multiple values, the decoded string and the position in vector where the decoding ended.
Sequences of octets in vector that cannot be decoded into characters according to external-format will be decoded as #\Replacement_Character.
encode-string-to-octets
string
&key
start
end
external-format
use-byte-order-mark
vector
vector-offset
Encodes string (or the substring delimited by start and end) into external-format and returns, as multiple values, a vector of octets containing the encoded data and an integer that specifies the offset into the vector where the encoded data ends.
When use-byte-order-mark is true, a byte-order mark will be included in the encoded data.
If vector is supplied, output will be written to it. It must be of type (simple-array (unsigned-byte 8)) and be large enough to hold the encoded data. If it is not supplied, the function will allocate a new vector.
If vector-offset is supplied, data will be written into the output vector starting at that offset.
Characters in string that cannot be encoded into external-format will be replaced with an encoding-dependent replacement character (#\Replacement_Character or #\Sub) before being encoded and written into the output vector.
Previous Section | Next Section | Table of Contents | Glossary | Index |