Entry-Level Unicode for XML

A “just enough education to perform” guide to Unicode and ISO 10646 for authors of XML parsers and other software that processes XML.

Contents

  1. Introduction
    1. Note on the Term “Application”
  2. Why Use Unicode?
  3. Scope of this Article
    1. Don’t Panic!
  4. ISO 10646 Vs Unicode
  5. From ABC to 01000001
    1. Characters
    2. Code Points
      1. U+ Notation
    3. Code Units
      1. UTF-32
      2. UTF-16
      3. UTF-8
        1. UTF-8, Security and “Non-Shortest Forms”
      4. Which Should I Use?
    4. Encoding Schemes
      1. The UTF-8 Encoding Scheme
      2. Encoding Schemes Based on the UTF-16 Encoding Form
        1. The UTF-16BE Encoding Scheme
        2. The UTF-16LE Encoding Scheme
        3. The UTF-16 Encoding Scheme
          1. The BOM
        4. Flipping Code Units
      3. Encoding Schemes Based on the UTF-32 Encoding Form
      4. Legacy Encoding Schemes
        1. Character References
      5. Which Should I Use?
  6. Character Properties and XML’s Char, NameChar and NameStartChar
    1. Implementing Verification of Character Types
  7. Overlap between Unicode and Markup Semantics
  8. Normalisation
    1. Unicode Normalisation Forms
      1. NFD
      2. NFC
      3. NFKD and NFKC
    2. NFC and XML
    3. W3C Character Model and Normalisation
  9. Restricting Allowed Characters
    1. Noncharacters
    2. Special Characters
      1. The BOM
      2. Control Characters and Other Special Characters
      3. Private Use Characters
    3. Machine-Targeted Data and US-ASCII
  10. Summary
  11. Appendices
    1. Appendix A — Example Character Property Lookup Code
      1. Variations
    2. Appendix B — Internationalised URIs
  12. References

Introduction

Okay, as part of gathering the loose bunch of documents and items here best collectively referred to as “stuff” into something approximating a website I’ve been thinking about articles I promised to write but never got around to.

Top of the list is an article on Unicode for people who are doing various things with XML or XML applications. There are a few questions that came up again and again where I was hopefully able to give an answer, and these questions are answered in this article again following the logic that they are probably useful to some other people out there. This is also the only article that anyone actually asked me to write as such, so it seems a productive task for me to get stuck into.

On the other hand I have a certain degree of trepidation with this, I know a few people who know a lot about Unicode, some of whom also know a lot about XML. Now, I always try to get knowledgeable people to review anything I write, but there is something that feels a bit hubristic about this, anyway at least I can badger them into reviewing this for me (the responsibility for any errors remains with me).

Another cause of hesitation is that the Unicode Standard [Unicode] (of which almost everything here is taken from the first five chapters) is one of the most readable books ever to have the word “Standard” in its title and an online version is freely available. Really you should go and read that instead, but people often don’t when I say that. So here’s something shorter that’ll help you out a bit and hopefully also make you realise that this stuff is interesting in itself and you’ll be convinced the book’s worth taking a look at.

Note on the Term “Application”

Just to clarify this from the beginning, when I say “XML application” in this document I’m referring to a type of XML document combined with a specification as to how it should be used, as such XHTML, RDF/XML and SVG are XML applications. I am not referring to the software that processes or produces such documents, although they are also referred to as “applications” in different contexts.

Why Use Unicode?

It’s common to begin a piece about Unicode or any other internationalisation matter with a small piece propagandising its value, telling you that successful applications will be used around the world and to design for internationalisation concerns is to design for success and so on. More compellingly there are plenty of ways in which supposedly simpler uses of legacy character sets get pretty complicated fast.

There are lots of good reasons, but I’m not going to get into them here. XML uses Unicode; if you don’t use Unicode you aren’t using XML. If you try to communicate with an application expecting or sending XML and you don’t use Unicode you really shouldn’t expect it to work. There are cases where you can take shortcuts, but you need to know about Unicode to know when these are safe.

Scope of this Article

This article aims only to give you enough information to process Unicode text that is received from one process (or file) and sent to another. This covers a large range of XML applications, in particular many that are designed to run on a webserver.

This article does not touch upon many issues that are of extreme importance if you are rendering Unicode text for display, or accepting user input in Unicode. If you are designing an XML application which deals with such matters you will need to study Unicode, and a few matters that go beyond Unicode, in much greater detail (alternatively you could examine using elements from the namespace of an existing application such as XHTML [XHTML] or SVG [SVG] for rendering or XForms [XForms] for input, which potentially will allow for a lot of code reuse as well).

Further, this article does not touch upon issues that may be specific to you code’s task. For instance you may need to do case-insensitive comparisons, but XML processing per se does not, and so this is not dealt with.

Don’t Panic!

Okay, this is a pretty big document. Considering its size and the fact that I’m only dealing with the essentials could make the whole topic seem quite daunting. However much of the time most of the matters here should be dealt with by an XML API (such as a SAX or DOM implementation) or through Unicode-aware string handling functions. Most developers need an awareness of these issues but won't have to get their hands dirty with it. It’s still worth knowing what’s going on, and particularly important to know what should be going on if you are debugging and need to know if a problem lies with your code or the library it is using.

ISO 10646 Vs Unicode

Something that can cause confusion is the difference between ISO 10646 [ISO 10646] and Unicode. Suffice to say:

  1. Both are referenced by the XML Specification.
  2. If you implement Unicode you have implemented ISO 10646 (though the inverse does not necessarily hold).

As such I’ll use “Unicode” in a liberal sense, where I might instead use “ISO 10646”, “UCS”, “the Universal Character Set” or “the Universal Character Set as defined by Unicode and ISO 10646”. It’ll make for less typing ☺

See Appendix C of the Unicode Standard [Unicode] if you want to know more.

From ABC to 01000001

Characters

Just what is a character? Is ‘a’ the shape you can see if you are reading this on screen or page, or is it the sound you can hear if you are using a screen reader, or is it the pattern of bumps you can feel if you are using a Braille reader? Is the ‘a’ in “nap” the same as the ‘a’ in “nape”? (for that matter is the ‘t’ in the English word “chat” the same as the ‘t’ in the French word «chat»?) Is ‘a’ the dot-dash of Morse Code, or the 0x61 of a byte in a computer’s memory?

Two different readers of this document with different browser settings may see very different shapes, one the “open” form of the letter with a line curving above a circle, the other the “closed” form.

None of these descriptions are adequate descriptions of the letter ‘a’. Yet young children manage to handle the concept of ‘a’ well enough to set them on the path to learning to read and write. If a two-year-old can manage it then it can’t be that hard!

Clearly ‘a’ is an abstract concept. Talking about it as an abstract concept is difficult, as are most abstract concepts, but we can recognise the relationship between this concept and the shapes, sounds and other communication methods.

Defining these abstract concepts narrowly enough to have a good guideline for what is or is not a character is tricky. Arguably ‘A’ is the same character as ‘a’, and <span style="text-transform: uppercase">a</span> makes this get really tricky, it looks like ‘a’, but it’s still ‘a’ (or is it?).

It’s tempting to consider “character” as an atomic concept but ‘ç’ can be broken into the characters ‘c’ and ‘¸’. Maybe it would have been better if we only considered ‘c’ and ‘¸’ to be characters, but both convenience and issues arising from existing practice mean that Unicode allows ‘ç’ as well.

The relationship between characters from different languages and scripts are tricky. Most people would say that the ‘a’ (“Ay”) in English is the same as the ‘a’ (“Ah”) in French, but that the ‘T’ in English is different to the ‘Т’ in Russian — the latter may have a similar shape when written (hence the band tAtU having a website at www.taty.ru; while ASCII, it looks like “тАтУ”, the Cyrillic form of their name [yes it would make me look cooler if I could find an example of one script’s letters masquerading as those of another that didn't involve teeny-boppers]), and a similar phonological value, but it is from a different alphabet and so used in a different context. However there are also characters that are shared between scripts, and of course a grey area where it may not be entirely clear whether a character is being shared between two scripts or where there are two separate characters (is the Greek question mark the same as the Latin semicolon? in this case Unicode provides a separate character for compatibility with other standards but considers the two to be equivalent).

This is tricky stuff. Luckily the lovely Unicode people have worried about it already (or gone with what was already an existing standard and grumbled about it — either way it’s not our problem). While there are one or two cases that seem a bit messy (because the multitude of current and historical writing systems is messy in itself) they’ve done a good job. Hence we can generally use the following guideline as to what is or is not a character:

A Character is anything that has been assigned a Unicode code point.

This is of course completely the wrong way around and hopelessly circular, but we’re talking about using Unicode, not developing it. Suffice to say that this definition of character includes every character in this document (both considered as a piece of human-targeted text and human-readable source), all the characters given special meaning by XML or other standards commonly used with it (such as URIs) and whole lot of others (around a quarter of a million).

Code Points

To do anything on a computer you of course have to get yourself into the realms of numbers. Unicode assigns numbers from the range 0 to 10FFFF16 to characters. Care was taken in choosing the numbers assigned to each character, notably the first 127 positions coïncide with US-ASCII. However there is no perfectly correct order in which characters can be assigned and while there is thought and logic behind the assignments they are of little value in themselves.

Some code techniques rely on features of a particular coded character set, for example the following C/C++ function would convert a US-ASCII string to being entirely in uppercase:

void make_upper(char* str)
{
    char ch;
    while (ch = *str){
        if (ch >= 'a' && ch <= 'z')
            *str = ch + 'A' - 'a';
        ++str;
    }
}

This works on the following assumptions:

  1. The characters ‘A’ to ‘Z’ and the characters ‘a’ to ‘z’ lie in continuous ranges.
  2. The characters ‘A’ to ‘Z’ are the only uppercase characters in the character set.
  3. A string can be converted to lowercase by replacing all instances of ‘A’ with ‘a’, all instances of ‘B’ with ‘b’ and so on until ‘Z’ is replaced with ‘z’.

None of these assumptions hold in general; the first is not true with EBCDIC, the second is not true with ISO-8859-1, the third doesn’t hold with any character set for which 2 doesn’t hold, and also for some languages (the uppercase of ‘ß’ is “SS”; In Turkish and Azerbaijani the upper case of ‘i’ is ‘İ’ and of ‘ı’ is ‘I’).

In Unicode the first does hold, but that’s not much use (especially since the distance between ‘A’ and ‘a’ isn’t the same as that between ‘Ā’ and ‘ā’).

The effect of all this is that you can’t really treat the numerical value of a code point as significant in any way.

However, even in the case of a simple character set like ASCII (which isn’t really that simple if you allow certain uses of control characters, but the past is another country) the above code isn’t the most efficient way of implementing that function, especially on a modern pipelined machine, so this isn’t as much of a lost as you might think.

U+ Notation

Now is a good time to introduce a notation that is used with Unicode. The character ‘a’ has the code point 97 or 6116. To state that we are talking about a Unicode code point, as opposed to any other use of that number, we write “U+” followed by the number in hexadecimal, using leading zeros to ensure there are at least 4 digits, but not using any leading zeros otherwise, hence U+0061 is the code point of ‘a’, but U+61 or U+000061 would not be used.

Characters are often referred to using the same notation, which is especially useful in the case of a combining character (on which more later) or a character that is generally represented by a glyph that looks much like another (Т and T often look similar, if not identical, the former is U+0422 and the latter is U+0054).

Of course you are quite likely none the wiser for this, and there is also the risk of ambiguity between when one is talking about a character and when one is talking about its code point, so to prevent ambiguity we can add the official Unicode name of the character. Hence the two examples above become U+0422 CYRILLIC CAPITAL LETTER TE and U+0054 LATIN CAPITAL LETTER T, which makes the characters of the two similar glyphs explicit.

Code Units

So far we have gone from one abstraction, a character, to another, an abstract integral number in the range 0 to 10FFFF16. We of course need to:

  1. Get this into a real datatype so our code can do something with it.
  2. Get this into a binary encoding that can be shared by two interoperating processes.

The second of these of course naturally follows from the first; we’ll get to that in the next section.

You are of course free to do whatever you want within your application. However a common approach to placing characters into datatypes will make for more understandable code, as well as forming a natural basis for a sharable binary encoding.

As such Unicode defines encoding forms for doing this. It defines 3 ways, which have different costs and benefits for different uses and on different machine architectures. We’ll look at each of them in turn:

UTF-32

The most obvious method is to pick an unsigned integral datatype large enough to contain every single code point, and use that. On most modern computers the smallest natural datatype size that has at least the 21 bits we need to contain everything up to 10FFFF16 is 32bits in size. Currently this is also the natural integer size that many processors deal with most efficiently, although 64-bit processors are becoming more common.

So, we simply stick each character into a 32-bit unsigned datatype and call this UTF-32.

UTF-16

UTF-32 has advantages of simplicity, but it is a bit wasteful; there are 11 bits that are never used, and 5 that are rarely used, for the most commonly used code points always exist within the first 1000016 from U+0000 to U+FFFF (collectively called the Basic Multilingual Plane or BMP). This waste isn’t a big deal when we are only dealing with a single character — especially given that code on 32bit processors will generally cast smaller datatypes up to 32bits before working with them. However in a document that contains the entire text of a novel this waste will account for megabytes of space.

There is also the fact that earlier versions of Unicode — which intended only to encode “commercially significant” characters — made use of the range U+0000 to U+FFFF only, and these were naturally catered for by a datatype that is 16 bits in size, and hence there is legacy code out there expecting Unicode characters to fit into 16 bits. A 16-bit encoding is therefore needed that can handle the entire range of 11000016 code points.

The solution is as follows:

  1. The code points U+D800 to U+DFFF will never be assigned to characters.
  2. Any other code point in the range U+0000 to U+FFFF is represented in UTF-16 by its code point value.
  3. Any code point in the range U+10000 to U+10FFFF can be represented in UTF-16 as two code units called “surrogate pairs” as follows:
    1. Subtract 1000016 from the code point value, to give a 20bit number in the range 0 to FFFFF16.
    2. The first code unit is the bitwise inclusive OR of the 10 highest-order bits of the 20bit number with D80016 and the second code unit is the bitwise inclusive OR of the 10 lowest-order bits of the 20bit number with DC0016.

Some people prefer to think of the last step as starting with the values D80016 and DC0016 and assigning the 10 highest and lowest bits to them, I prefer to think of this as an OR operation because of the most obvious implementation in my language of choice.

This clearly solves our problem of how to get 21bits numbers into a 16bit datatype. It’s worth noting two things here, one good, one bad:

  1. There is no value that can be both the first “high surrogate” and the second “low surrogate”, hence unlike other multi-unit encodings there is no possible confusion about what a code unit means; it is either clearly representing a code point from the BMP, or else is clearly a high surrogate that must be followed by a low surrogate, or is clearly a low surrogate that must follow a high surrogate.
  2. A process ordering strings by UTF-16 code units won’t order them in the same way as UTF-32 or UTF-8 (which we’ll get to in a moment). Ordering by code units isn’t going to give a good ordering anyway (this requires a locale-dependent collation algorithm), but there are a lot of programming tasks for which it is important to have an ordering and any ordering will do (e.g. finding an exact match for a string in an ordered collection of strings through a binary search). Ordering by code units is the most efficient way of doing this, but in the rare cases where the ordering has to work the same with both UTF-16 and another encoding this will not work.

UTF-8

Just as there is a lot of code that expects 16bit code units, there is a huge amount of code that expects 8bit code units.

Further there is a lot of code that expects certain strings of US-ASCII to be significant in some application-defined way, though it tolerates 8bit code units outside of the range 0 to 7F16 so that it can work with any of the large number of 8bit encodings which share the first 128 values with US-ASCII (for example any of the ISO 8859 encodings).

The solution is really quite a beautiful piece of design that manages to handle these requirements, along with a few others I won’t go into here, without doing any violence to the structure of the encoded code point. It works as follows:

  1. If the code point is less than 8016 then just place it directly into the octet stream, otherwise output 2 to 4 octets as follows:
  2. Output the first octet as follows:
    1. If the code point is less than 80016 then output the bitwise inclusive OR of bits 7 through 11 with C016.
    2. Otherwise, if the code point is less than 1000016 output the bitwise inclusive OR of bits 13 through 16 with EO16.
    3. Otherwise output the bitwise inclusive OR of bits 19 through 21 with F016.
  3. There will now be 6, 12 or 18 bits left to encode, these are encoded by outputting 1, 2 or 3 octets of 6bits ORed with 8016.

This is easily visualised by looking at the following table:

Code Point Range Code Point in Binary (21 bits shown) Octets in Binary
U+0000 to U+007F 00000000000000XXXXXXX2 0XXXXXXX2
U+0080 to U+07FF 0000000000XXXXXYYYYYY2 110XXXXX2 10YYYYYY2
U+0800 to U+FFFF 00000XXXXYYYYYYZZZZZZ2 1110XXXX2 10YYYYYY2 10ZZZZZZ2
U+10000 to U+10FFFF WWWXXXXXXYYYYYYZZZZZZ2 11110WWW2 10XXXXXX2 10YYYYYY2 10ZZZZZZ2

There are specifications for encoding any code point up to 7FFFFFFF16 using UTF-8, but these are obviously not used with Unicode, where no code point about 10FFFF16 exists.

Some noteworthy features of UTF-8:

  1. Ordering UTF-8 strings according to code units has the same results as ordering UTF-32 strings according to code units.
  2. From any position within the string it is easy to see if you are at the start of a Unicode character. If not you can get to the start by testing no more than 3 characters (or 2 tests and one assumption if you can rely on the string being valid).
  3. From the start of any character the position of the next character is clear.
  4. There are no overlapping values, any octet is either clearly a single character, clearly the start of a character, or clearly part of a character where the start is 1 to 3 octets earlier in the sequence.
UTF-8, Security and “Non-Shortest Forms”

One thing to be aware with UTF-8 is that naïve processors may assume that the UTF-8 is valid and hence just strip the relevant bits from the correct number of bytes. Such a processor would interpret the octet sequence C016 AF16 as an encoding of U+002F SOLIDUS. This character is significant in quite a few cases, including URIs and XML. As such potentially dangerous constructs could contain it (e.g. “/../”) which must be filtered. If filtering is performed on the UTF-8 then such a filter might not prohibit the same construct using the longer sequence (U+002F is of course not the only character where this poses a security problem, but is notable for being a culprit in some real security holes see [rain forest puppy]).

These sequences are not valid UTF-8, and an XML processor should consider them to be a fatal error. The Unicode Standard describes a way to catch non-shortest forms by matching patterns of the source UTF-8 sequences, which has the advantage of catching other illegal sequences (sequences that encode something above 10FFFF16 or a surrogate code point). Personally I prefer checking the code point is within the correct range for the number of UTF-8 code units after decoding, the code just seems more explicit to me that way.

As an optimisation the more naïve processing can be used if you can be 100% certain that the source is valid UTF-8 since it will be slightly more efficient. Essentially, this is only possible if the source is another piece of the same code — even in-process DLLs should be considered suspect and really, I don't think this optimisation is worth it.

This insecure test document should be rejected by any XML parser, otherwise using it could lead to security issues.

Which Should I Use?

When presented with 3 choices the obvious question is “which is the best?” The shrewder reader will ask, “which is the best in which case?” but the question still remains.

The answer depends on the type of processor the code will run on, the amount of text that will have to be stored on disk and in memory, and the type of processing done.

In general I’d recommend UTF-32 if you are dealing with one or a small number of characters and UTF-16 otherwise. However that’s only a very rough guideline, notably there is quite a lot of tasks that can be done efficiently on UTF-8 data without converting it out of UTF-8, especially if most of the characters relating to the task are within the range 0 - 7F16 (all the characters given a particular meaning in XML1.0 and all but two given a particular meaning in XML1.1 are within that range, and those two characters are both considered equivalent to U+000A in XML1.1).

Encoding Schemes

To transfer Unicode data between processes, especially if those processes are on different machines, we have more work to do. We need to drop yet another level of abstraction from datatypes down to octets.

Of course we could define a serialisation of Unicode characters to octets without considering the encoding forms, but they have already done much of the work for us and form natural bases for encoding schemes. Hence there are encoding schemes based on the encoding forms, and they work as follows:

The UTF-8 Encoding Scheme

Because UTF-8 is already defined in terms of octets the encoding scheme based on it is trivially defined, each code unit is an octet and we just use that octet in the octet stream. UTF-8 may, but does not need to, use a BOM (see below).

Encoding Schemes Based on the UTF-16 Encoding Form

With UTF-16 there is not the same obvious decision about how we should serialise the data as octets. The 16bit number 123416 can be represented as the octet 1216 followed by the octet 3416 (big-endian) or as the octet 3416 followed by the octet 1216 (little-endian). Internally some computers use one way, and some the other.

There are three solutions used, with the result of three encoding schemes:

The UTF-16BE Encoding Scheme

The UTF-16BE encoding scheme outputs each UTF-16 code unit as a pair of octets in big-endian order, hence “UTF” which is U+0055, U+0054, U+0046 is output as the following octets: 00, 55, 00, 54, 00, 46.

The UTF-16LE Encoding Scheme

The UTF-16LE encoding scheme outputs each UTF-16 code unit as a pair of octets in little-endian order, hence “UTF” which is U+0055, U+0054, U+0046 is output as the following octets: 55, 00, 54, 00, 46, 00.

The UTF-16 Encoding Scheme

The UTF-16 encoding scheme operates either as UTF-16BE or as UTF-16LE. Since the same stream would have different meanings depending on which of these is used the UTF-16 encoding scheme should either have its byte order indicated through some out-of-band mechanism, or begin with a BOM

The BOM

The BOM (Byte Order Mark) is the character U+FEFF. The code point U+FFFE is a noncharacter, that is it is not assigned to a character and never will be. Now if you receive a stream in the UTF-16 encoding scheme that begins with the octets FE16, FF16 then it must be in big-endian order, as the little-endian interpretation of those two characters is U+FFFE, which is a noncharacter. Similarly, if the stream begins with FF16, FE16 then it must be in little-endian order.

Annoyingly, U+FEFF is also used as ZERO WIDTH NO-BREAK SPACE. While that character is deprecated for that use, it does still occur — hence if a BOM turns up somewhere unexpected it is not necessarily an error.

Unicode allows for the assumption of big-endian order (that is UTF-16 without a BOM should be considered to be in big-endian order), however XML does not permit this.

When used with the UTF-8 encoding scheme the BOM is serialised as the sequence EF16, BB16, BF16. This isn’t needed for ordering, since UTF-8 is always in the same order, but it does act as a signature of UTF-8 — in general 8-bit legacy encodings are very unlikely to begin with those 3 octets.

Flipping Code Units

If you have to deal with 16bit code units that are in the opposite byte order to that used by your code you have to flip them. This is efficiently done by the following C/C++/Java code:

flipped = (input >> 8) | (input << 8)

Encoding Schemes Based on the UTF-32 Encoding Form.

You can probably guess these! UTF-32BE outputs 4 octets for each UTF-32 code unit, using big-endian format. UTF-32LE outputs 4 octets for each UTF-32 code unit, using little-endian format The UTF-32 encoding scheme uses a BOM to indicate byte order. UTF-32 is not often used to transfer XML.

Legacy Encoding Schemes

Another way to transmit Unicode data is to convert it into another character set, transmit it, and then encode it back. This is useful in a few situations — for example if you are creating XML with a text-editor that doesn’t allow you to save in a Unicode encoding scheme, or if you want to insert some legacy data into your XML with little processing (although care must still be taken to avoid characters significant to the syntax of XML).

Character References

Character references can be used to represent a character for one of three reasons:

  1. It is a character — such as ‘<’, ’>’, ‘'’, ‘&’, ‘"’ which has a significance within the syntax of XML itself, and it is necessary to use a character reference to escape the character from this use.
  2. It is an awkward character for an author to type.
  3. It is not a character that can be represented in the legacy encoding scheme used to encode the document.

All character references make use of Unicode code points to represent a character, either in decimal or in hexadecimal (so ‘a’ can be represented either by &#x61; or by &#97;). It’s important to note that values are not related to the character set used. In HTML it is not uncommon to see the trademark symbol (U+2122) encoded as &#153;. This is not the correct reference, but rather a reference to a control character (however some legacy character sets do have that character at that position, which is no doubt the source of the error). Some browsers, on some operating systems, with some locale settings, will correct this mistake and render the correct character — but not all. Really the reference should be either &#8482; or &#x2122;.

Which Should I Use?

Again, the different options available means there is the question of which should be used in which cases. If you are writing an application that accepts XML you have no choice but to write it so that it will work with both the UTF-8 and UTF-16 encoding schemes, the XML specs insist on that.

Accepting UTF-16BE, UTF-16LE, the encoding schemes based on UTF-32 and any legacy schemes is optional. In general once you’ve written code to process UTF-16 it is easy to accept UTF-16BE and UTF-16LE (just bypass the interpretation of the BOM) and once you’ve written UTF-8 it is easy to accept US-ASCII (just consider any octet with a value higher than 7F16 to be an error, and then treat the characters lower than 8016 the same as you would in UTF-8. I’d recommend you do accept those encodings, particularly US-ASCII, as they are quite often used.

All the rest are extra credit. You can leave it at just UTF-8 and UTF-16 and you have a conformant parser, or you could go nuts and write code to handle every encoding scheme listed by IANA [IANA Char Reg].

When outputting, you can use any encoding registered with IANA, however since conformant parsers only have to accept UTF-8 and UTF-16 it is best to go with one of those.

In most cases the size of the octet stream is the biggest concern with something that will, or at least might, be transmitted over an Internet connection. Since the number of octets UTF-8 uses to encode a code point depends on the range of values that code point comes from (this is also true of UTF-16, but for all characters where UTF-16 would use 4 octets UTF-8 also uses 4) as a rule if the text is mainly from an East Asian language that uses Han ideographs of Hangul syllables (Japanese, Chinese or Korean) UTF-16 will result in a smaller octet stream, otherwise UTF-8 will.

With a perfect compression system either would compress to the same size. In practice of course we don’t have perfect compression schemes, however if you are compressing the differences between the sizes of the outputs you will have with UTF-8 and UTF-16 becomes smaller as the amount of text encoded becomes larger.

It’s worth noting that XML canonicalisation [C14N] requires that UTF-8 without a BOM be used — it mandates a particular encoding scheme so that binary operations like cryptographic signatures will always be working on the same octets for the same piece of XML. However this requirement is only at the stage where octets are sent to the octet-based process (such as SHA-1 hashing) that requires them, in general no process should require a particular encoding — they should all accept UTF-8 (with or without BOM) and UTF-16 at a minimum.

If you can, give your users a choice or make this something that can be configured by an administrator.

Character Properties and XML’s Char, NameChar and NameStartChar

As well as naming a character, giving an example glyph, and possibly adding an explanatory note Unicode ascribes various properties to each assigned character. For example U+0061 LATIN SMALL LETTER A is a lowercase letter, it is not a combining character, it is normally used in text written left-to-right, it can be converted to uppercase by replacing it with U+0041 LATIN CAPITAL LETTER A, and it has no numerical value.

These and other properties are available for each character. However using these is out of the scope of this article.

A related matter that is pertinent though, is the fact that, largely based on these properties, XML defines certain characters that can be used in certain contexts.

XML1.0 and XML1.1 allow different characters in different contexts, but for the most part I will only describe the XML1.0 usage, XML1.1 usage is analogous.

The first definition that is relevant here is that of a Char:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

This defines which characters can be used in an XML1.0 document. It is clearly very liberal, banning only some of the control characters and the noncharacters U+FFFE and U+FFFF. Indeed it is somewhat too liberal in my view since it allows other noncharacters (the code points from U+FDD0 to U+FDEF inclusive and the last 2 code points in each plane, from U+1FFFE & U+1FFFF through to U+10FFFE & U+10FFFF, are noncharacters) but the production quoted above allows them.

This is potentially problematic, since it is not clear whether one should accept the character U+FDD0 because it is allowed in the above production or reject it, because it is not a valid Unicode character. XML1.1 contains a note to avoid these characters, and also some of the control characters, but still doesn’t prohibit them per se.

Really though, all this means is that XML gives you a valid way of communicating nonsense — since there is no character U+FDD0 it cannot be in a piece of meaningful text, or part of a meaningful name or token. As such it’s safe to consider it an error — if you see it there was nothing wrong with the XML itself, but what it communicated was meaningless, and your application can just refuse to deal with meaningless communications. This isn’t to say that you should necessarily always treat noncharacters in this manner, but I can’t foresee problems if you do.

The next relevant definition is given different names in XML1.0 and XML1.1, XML1.0 uses the name “Letter” XML1.1 uses the somewhat more precise name “NameStartChar”. These are the characters that a name (such as an element or attribute name) are allowed to start with. XML1.0 defines it thus:

[84] Letter ::= BaseChar | Ideographic
[85] BaseChar ::= [#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x00FF] | [#x0100-#x0131] | [#x0134-#x013E] | [#x0141-#x0148] | [#x014A-#x017E] | [#x0180-#x01C3] | [#x01CD-#x01F0] | [#x01F4-#x01F5] | [#x01FA-#x0217] | [#x0250-#x02A8] | [#x02BB-#x02C1] | #x0386 | [#x0388-#x038A] | #x038C | [#x038E-#x03A1] | [#x03A3-#x03CE] | [#x03D0-#x03D6] | #x03DA | #x03DC | #x03DE | #x03E0 | [#x03E2-#x03F3] | [#x0401-#x040C] | [#x040E-#x044F] | [#x0451-#x045C] | [#x045E-#x0481] | [#x0490-#x04C4] | [#x04C7-#x04C8] | [#x04CB-#x04CC] | [#x04D0-#x04EB] | [#x04EE-#x04F5] | [#x04F8-#x04F9] | [#x0531-#x0556] | #x0559 | [#x0561-#x0586] | [#x05D0-#x05EA] | [#x05F0-#x05F2] | [#x0621-#x063A] | [#x0641-#x064A] | [#x0671-#x06B7] | [#x06BA-#x06BE] | [#x06C0-#x06CE] | [#x06D0-#x06D3] | #x06D5 | [#x06E5-#x06E6] | [#x0905-#x0939] | #x093D | [#x0958-#x0961] | [#x0985-#x098C] | [#x098F-#x0990] | [#x0993-#x09A8] | [#x09AA-#x09B0] | #x09B2 | [#x09B6-#x09B9] | [#x09DC-#x09DD] | [#x09DF-#x09E1] | [#x09F0-#x09F1] | [#x0A05-#x0A0A] | [#x0A0F-#x0A10] | [#x0A13-#x0A28] | [#x0A2A-#x0A30] | [#x0A32-#x0A33] | [#x0A35-#x0A36] | [#x0A38-#x0A39] | [#x0A59-#x0A5C] | #x0A5E | [#x0A72-#x0A74] | [#x0A85-#x0A8B] | #x0A8D | [#x0A8F-#x0A91] | [#x0A93-#x0AA8] | [#x0AAA-#x0AB0] | [#x0AB2-#x0AB3] | [#x0AB5-#x0AB9] | #x0ABD | #x0AE0 | [#x0B05-#x0B0C] | [#x0B0F-#x0B10] | [#x0B13-#x0B28] | [#x0B2A-#x0B30] | [#x0B32-#x0B33] | [#x0B36-#x0B39] | #x0B3D | [#x0B5C-#x0B5D] | [#x0B5F-#x0B61] | [#x0B85-#x0B8A] | [#x0B8E-#x0B90] | [#x0B92-#x0B95] | [#x0B99-#x0B9A] | #x0B9C | [#x0B9E-#x0B9F] | [#x0BA3-#x0BA4] | [#x0BA8-#x0BAA] | [#x0BAE-#x0BB5] | [#x0BB7-#x0BB9] | [#x0C05-#x0C0C] | [#x0C0E-#x0C10] | [#x0C12-#x0C28] | [#x0C2A-#x0C33] | [#x0C35-#x0C39] | [#x0C60-#x0C61] | [#x0C85-#x0C8C] | [#x0C8E-#x0C90] | [#x0C92-#x0CA8] | [#x0CAA-#x0CB3] | [#x0CB5-#x0CB9] | #x0CDE | [#x0CE0-#x0CE1] | [#x0D05-#x0D0C] | [#x0D0E-#x0D10] | [#x0D12-#x0D28] | [#x0D2A-#x0D39] | [#x0D60-#x0D61] | [#x0E01-#x0E2E] | #x0E30 | [#x0E32-#x0E33] | [#x0E40-#x0E45] | [#x0E81-#x0E82] | #x0E84 | [#x0E87-#x0E88] | #x0E8A | #x0E8D | [#x0E94-#x0E97] | [#x0E99-#x0E9F] | [#x0EA1-#x0EA3] | #x0EA5 | #x0EA7 | [#x0EAA-#x0EAB] | [#x0EAD-#x0EAE] | #x0EB0 | [#x0EB2-#x0EB3] | #x0EBD | [#x0EC0-#x0EC4] | [#x0F40-#x0F47] | [#x0F49-#x0F69] | [#x10A0-#x10C5] | [#x10D0-#x10F6] | #x1100 | [#x1102-#x1103] | [#x1105-#x1107] | #x1109 | [#x110B-#x110C] | [#x110E-#x1112] | #x113C | #x113E | #x1140 | #x114C | #x114E | #x1150 | [#x1154-#x1155] | #x1159 | [#x115F-#x1161] | #x1163 | #x1165 | #x1167 | #x1169 | [#x116D-#x116E] | [#x1172-#x1173] | #x1175 | #x119E | #x11A8 | #x11AB | [#x11AE-#x11AF] | [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] | #x11EB | #x11F0 | #x11F9 | [#x1E00-#x1E9B] | [#x1EA0-#x1EF9] | [#x1F00-#x1F15] | [#x1F18-#x1F1D] | [#x1F20-#x1F45] | [#x1F48-#x1F4D] | [#x1F50-#x1F57] | #x1F59 | #x1F5B | #x1F5D | [#x1F5F-#x1F7D] | [#x1F80-#x1FB4] | [#x1FB6-#x1FBC] | #x1FBE | [#x1FC2-#x1FC4] | [#x1FC6-#x1FCC] | [#x1FD0-#x1FD3] | [#x1FD6-#x1FDB] | [#x1FE0-#x1FEC] | [#x1FF2-#x1FF4] | [#x1FF6-#x1FFC] | #x2126 | [#x212A-#x212B] | #x212E | [#x2180-#x2182] | [#x3041-#x3094] | [#x30A1-#x30FA] | [#x3105-#x312C] | [#xAC00-#xD7A3]
[86] Ideographic ::= [#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029]

Finally there is the definition “NameChar” which XML1.0 defines thus:

[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender
[87] CombiningChar ::= [#x0300-#x0345] | [#x0360-#x0361] | [#x0483-#x0486] | [#x0591-#x05A1] | [#x05A3-#x05B9] | [#x05BB-#x05BD] | #x05BF | [#x05C1-#x05C2] | #x05C4 | [#x064B-#x0652] | #x0670 | [#x06D6-#x06DC] | [#x06DD-#x06DF] | [#x06E0-#x06E4] | [#x06E7-#x06E8] | [#x06EA-#x06ED] | [#x0901-#x0903] | #x093C | [#x093E-#x094C] | #x094D | [#x0951-#x0954] | [#x0962-#x0963] | [#x0981-#x0983] | #x09BC | #x09BE | #x09BF | [#x09C0-#x09C4] | [#x09C7-#x09C8] | [#x09CB-#x09CD] | #x09D7 | [#x09E2-#x09E3] | #x0A02 | #x0A3C | #x0A3E | #x0A3F | [#x0A40-#x0A42] | [#x0A47-#x0A48] | [#x0A4B-#x0A4D] | [#x0A70-#x0A71] | [#x0A81-#x0A83] | #x0ABC | [#x0ABE-#x0AC5] | [#x0AC7-#x0AC9] | [#x0ACB-#x0ACD] | [#x0B01-#x0B03] | #x0B3C | [#x0B3E-#x0B43] | [#x0B47-#x0B48] | [#x0B4B-#x0B4D] | [#x0B56-#x0B57] | [#x0B82-#x0B83] | [#x0BBE-#x0BC2] | [#x0BC6-#x0BC8] | [#x0BCA-#x0BCD] | #x0BD7 | [#x0C01-#x0C03] | [#x0C3E-#x0C44] | [#x0C46-#x0C48] | [#x0C4A-#x0C4D] | [#x0C55-#x0C56] | [#x0C82-#x0C83] | [#x0CBE-#x0CC4] | [#x0CC6-#x0CC8] | [#x0CCA-#x0CCD] | [#x0CD5-#x0CD6] | [#x0D02-#x0D03] | [#x0D3E-#x0D43] | [#x0D46-#x0D48] | [#x0D4A-#x0D4D] | #x0D57 | #x0E31 | [#x0E34-#x0E3A] | [#x0E47-#x0E4E] | #x0EB1 | [#x0EB4-#x0EB9] | [#x0EBB-#x0EBC] | [#x0EC8-#x0ECD] | [#x0F18-#x0F19] | #x0F35 | #x0F37 | #x0F39 | #x0F3E | #x0F3F | [#x0F71-#x0F84] | [#x0F86-#x0F8B] | [#x0F90-#x0F95] | #x0F97 | [#x0F99-#x0FAD] | [#x0FB1-#x0FB7] | #x0FB9 | [#x20D0-#x20DC] | #x20E1 | [#x302A-#x302F] | #x3099 | #x309A
[88] Digit ::= [#x0030-#x0039] | [#x0660-#x0669] | [#x06F0-#x06F9] | [#x0966-#x096F] | [#x09E6-#x09EF] | [#x0A66-#x0A6F] | [#x0AE6-#x0AEF] | [#x0B66-#x0B6F] | [#x0BE7-#x0BEF] | [#x0C66-#x0C6F] | [#x0CE6-#x0CEF] | [#x0D66-#x0D6F] | [#x0E50-#x0E59] | [#x0ED0-#x0ED9] | [#x0F20-#x0F29]
[89] Extender ::= #x00B7 | #x02D0 | #x02D1 | #x0387 | #x0640 | #x0E46 | #x0EC6 | #x3005 | [#x3031-#x3035] | [#x309D-#x309E] | [#x30FC-#x30FE]

Implementing Verification of Character Types

Because some of these definitions involve a large number of non-continuous ranges, which cannot be identified algorithmically, implementing code that verifies whether a given character is or is not of an allowed type for a particular context can be problematic; using a switch statement, an if-else chain, nested ranges, or a large Boolean expression are neither efficient nor particularly readable code. Since these lookups are often performed frequently efficiency is of significant importance.

The obvious solution with a small (say 8-bit) character set would be to use a lookup table; an array of 256 values could be indexed with the code point to retrieve the relevant value. With Unicode this would require at least 11000016 values (or 20000016 if you wanted to guard against using an invalid pointer only by ensuring the upper 11 bits of a 32bit datatype where 0 — forcing a value to be 1FFFFF16 or less can be considerably more efficient than ensuring it is 10FFFF16 or less on many machines).

Even if each value were 1 bit in size (say, for a Boolean value) this would require 136kilobytes of data to lookup. This may be acceptable — much of the table would remain cached to disk for the lifetime of the table, but it also may not.

One solution that is suggested in the Unicode Standard for implementing lookup of character properties can be applied here as well.

  1. We create two tables, one is indexed by the higher order bits and allows us to retrieve an offset into the second table, we then add the lower order bits to this offset and use that as an index into the second table to retrieve the relevant value of the character.
  2. If any “block” of the second table (where a block is a group of values that can be reached after retrieving a given offset from the first table) is a duplicate of another we can alter the offset that points into that block to point to the block it is a duplicate of and then remove the duplicate block.

Because duplicate blocks are removed there is less storage needed for the data (this can be compared to removing duplicate states in a finite state automaton, which essentially it is). Because the code points assigned to characters try to group together related characters these duplicate blocks are actually quite frequent. Lookup can operate without branching penalties, with the result that it will be very fast. More than 2 stages can be used, with an advantage in memory size but a disadvantage in lookup speed.

See Appendix A for an example piece of code implementing this.

Overlap Between Unicode and Markup Semantics

Some characters, particularly control characters, in Unicode have semantics that are, or can be, also provided by XML markup. For example Unicode can tag a piece of text as being in a particular language, as can the xml:lang attribute, Unicode can indicate explicit instructions as to whether a piece of text should be rendered left-to-right or right-to-left, as can the dir attribute in XHTML [XHTML]. Further a lot of Unicode characters are essentially variants on other characters which have different semantics and are generally rendered differently (bolder, superscript, subscript) to indicate those different semantics, an application-specific XML application may offer stronger and/or more detailed semantics, and any rendering will reflect those semantics in an appropriate way (hence in Unicode plain text we can say E=MC², but in HTML we can say E=MC<sup>2</sup> and it will render as E=MC2, a mathematical or physics markup language such as MathML [MathML] allows for even more precise semantics in this case).

Unicode in XML and other Markup Languages [UnicodeXML] (jointly published by the W3C and the Unicode Consortium) generally recommends that the features of the markup be used rather than features of Unicode. See that document for more details.

If you are designing an XML application were features will reflect the appearance of text to a reader, or if you will allow XML elements from an application, such as XHTML that does this, then you need to consider these interactions. However this goes into aspects of Unicode that are beyond the scope of this article.

Normalisation

We noted above that Unicode contains characters such as ‘ç’ that can be broken down into characters (‘c’ and ‘¸’ in this case).

Unicode considers either form to be equivalent, and processes are not meant to treat them differently.

There are also single characters that are considered equivalent to other single characters by Unicode characters, for example U+212B ANGSTROM SIGN is considered equivalent to U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE. In this case the reason for encoding U+212B separately is to allow round-trip conversion between this and other encoding schemes that considered the two to be separate.

Further, Korean characters can be encoded equivalently as either alphabetic components called Jamo or as syllables.

Further complications ensue if there are two combining characters combining with the same base character. In the case of a letter having both a cedilla and an acute accent it doesn’t really matter which is considered to be added first — the cedilla and the acute accent will not interfere with each other. However, in the case of a letter having both an umlaut and an acute accent the order is important; it affects whether the umlaut goes above the accent or the accent above the umlaut. There are still more complications where a particular combination of two different diacriticals requires special treatment, but this is a matter of rendering and beyond the scope of what we are going to tackle here.

Unicode deals with the question of whether the order of two combining characters is significant or not by assigning each character a combining class. If the combining class is zero the character is not a combining character, otherwise if two characters have the same combining class their order is significant and re-ordering them would alter the meaning of the text.

Unicode further defines a canonical ordering of combining characters, where the lower the combining class the earlier it goes in the sequence. Putting characters into this canonical ordering can simplify many tasks.

Finally there are four different normalisation forms that Unicode text can be put into.

Unicode Normalisation Forms

NFD

In Normalisation Form D any character that can be canonically decomposed — that is replaced with a sequence that is considered to be canonically equivalent — is. This means that U+212B ANGSTROM SIGN is replaced with U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE, U+00C7 LATIN CAPITAL LETTER C WITH CEDILLA is replaced with U+0043 LATIN CAPITAL LETTER C followed by U+0327 COMBINING CEDILLA and all Korean syllables are replaced with the equivalent Jamo sequences.

Then any combining characters in the text are placed into the canonical order.

NFC

In Normalisation Form C text is first placed into NFD and then re-combined so that Korean Jamo sequences are replaced with syllables, and other characters are combined as follows:

  1. The text will consist of a series of base characters followed by zero or more combining characters.
  2. For each combining character there is an opportunity to combine it with the base character if there is no character in between them with the same combining class as the combining character in question (this is to avoid a re-ordering that would change the meaning).
  3. If there is a character which is canonically equivalent to the base character followed by the combining character that is not excluded from NFC (see below) then the base character is replaced by that character, the combining character is removed, and the examination of combining characters continues from that point.

Clearly NFC will be more compact than the equivalent NFD. It is not the most compact form of normalisation theoretically possible, but it is a form that can be produced very efficiently by a streaming processor that doesn’t use much memory, and one that can be checked with an isNFC() function extremely efficiently.

As mentioned above, some characters are excluded from NFC; this is to ensure that additions to Unicode won’t mean that meaningful text in NFC won’t suddenly cease to be in NFC because a new combining character was added.

NFKD and NFKC

Normalisation Form KD and Normalisation Form KC are similar to NFD and NFC respectively. However as well as decomposing characters that Unicode considers to be equivalent characters are also decomposed that are considered “compatibility composites”. This can change the meaning of the text, for instance U+00B2 SUPERSCRIPT TWO is decomposed to U+0032 DIGIT TWO. NFKD just decomposes and reorders; NFKC then recomposes canonical sequences only. These forms should not be “blindly” applied to text, however they are useful in some contexts such as searching and “find text” operations.

NFC and XML

As mentioned above Unicode requires that canonically equivalent sequences be considered equivalent. XML does not consider such sequences equivalent, and will not match <façade> with </façade> if one is formed using the combining cedilla character.

One possible solution to this is for an XML application to insist on text being in NFC, in this way canonically equivalent sequences will always have the same sequence of code points. However XML does not insist on this in itself, though the use of NFC is recommended.

If you are writing code to work with a XML application designed by someone else that does not insist on NFC then it is not advisable to convert the text to NFC. While this would be an appealing solution to the problem it raises security issues (similar to those caused by malformed UTF-8).

W3C Character Model and Normalisation

The W3C character model [Charmod] (currently a working draft) defines degrees of normalisation that go beyond NFC in requiring that concatenating text nodes and/or inserting text when processing entity references of XML Includes do not stop the text being in NFC. Essentially this means that you should not begin the text in a text node, in an included piece of text, or immediately following the point where text would be included with a combining character.

None of this is mandated, but you can (with the due caution that comes with using something from a working draft) use these definitions when creating your own XML applications.

In particular it’s interesting to note that if you have a text node beginning with U+0338 COMBINING LONG SOLIDUS OVERLAY then NFC would replace it and the preceding > character with U+226F NOT GREATER-THAN, resulting in a document that is no longer well-formed XML.

Restricting Allowed Characters

It is allowable for a Unicode application to only use some of the characters and/or features available. However this is not a recipe for throwing everything away and just using ASCII.

There are however plenty of people who want to just use ASCII, especially if they want to write code that works directly on the “raw” text of an XML document. Really, life is simpler if you use an API that handles XML properly and insulates you from the raw octets but people insist on doing this.

There are times when this is reasonable enough, and times when it isn’t. Every list, bulletin board and newsgroup about XML or an XML application will occasionally have questions from people who are performing pattern-matching against the input octets and finding it doesn’t work. Sometimes the people on those lists will be helpful to these people, sometimes they’ll eat the head of them for doing something you aren’t supposed to do and being surprised when it doesn’t work (sometimes the question is framed as an accusation or complaint, then they always get flamed to a crisp and frankly deserve it). Look before you leap into applying such pattern-matching techniques and avoid the flames.

For the type of code we are talking about here, where many of the issues about fonts, directionality, layout of combining marks, ligatures etc. don’t concern us there is little reason why one should not accept and process the entire range of Unicode characters. The complexities of correctly rendering the line breaks between text that contains a mixture of Thai, Hebrew, Ogham and Kanji don’t have any bearing on what we’re dealing with so there is nothing to stop you from accepting these characters (and if you are writing code where those issues arise you should have a better fallback behaviour than producing gibberish).

Anyway, there are valid reasons for restricting characters allowed, so in decreasing order of reasonableness here are characters you might want to prohibit. It’s important to stress that this is only a matter of the design of an XML application, not of the code that processes it. If you are writing code to parse or produce documents of an already existing XML application, or if you are designing an XML application that needs to be compatible with another XML application then you don’t have much say in what characters are allowed.

Noncharacters

While the XML spec doesn’t prohibit all noncharacters it is hard to see what one is meant to do with them. Explicitly banning them is probably a good idea. Outputting them is probably a bad idea even when working with an XML application whose spec has nothing to say about them.

Special Characters

The BOM

The BOM should be processed as described above. Use of U+FEFF as a zero-width no-break space is problematic because of the potential ambiguity, U+2060 WORD JOINER should be used instead, but prohibiting this use of U+FEFF may cause difficulties with legacy data so it is probably best to merely advise against it.

Control Characters and Other Special Characters

Some control characters are significant in XML (tab and the various newline function characters), others are explicitly banned, others clash with what can be done at the mark-up level in XML and are best avoided. See [UnicodeXML] for details.

Private Use Characters

The characters U+E000 to U+F8FF inclusive are the primary Private Use Area, the characters U+F0000 to U+FFFFD inclusive are the Supplementary Private Use Area-A and the characters U+100000 to U+10FFFD inclusive are the Supplementary Private Use Area-B. Protocols or agreements between parties may define interpretations for these characters, though clearly this prevents general interchange. There are three approaches one might take to these characters when defining an XML application:

  1. Prohibit their use. This ensures that a document will have the same meaning to any party that receives it.
  2. Define interpretations for some of the characters. As such the interpretations defined will become part of the specification, along with the elements, attributes and so on defined in the specification. This may be useful in some cases but for the most part it is better to use mechanisms from XML to communicate things from the application domain. A private use character that indicates a character in a collection of historical documents that has not yet been encoded or properly identified would be a reasonable use in a specialist application for encoding such documents; a private-use character to indicate your corporate monogram reduces interoperability for no real benefit over an image inclusion element.
  3. Allow private use characters, and leave the interpretation to the parties involved. This is the best approach for a very general-purpose application. You neither use nor define the use of private use characters, but you do allow other people to come to an agreement as to how they will use your application together with private use characters. Obviously such characters may only be sensibly be used where relatively “freeform” text is allowed, rather than in text nodes or attribute values expected to contain precisely specified information such as a date or the name of a town.

Machine-Targeted Data and US-ASCII

XML applications can be divided into two types; human-targeted or “document-orientated” applications where the text will, eventually, be presented to a human reader, and machine-targeted or “data-orientated” applications where the text, while human-readable, is intended to be interpreted by software. Of course human-targeted XML will contain at least some text that is for the software rather than the user (e.g. the href attribute of an <a> element in XHTML) and in the case of SVG while some text may be presented to the user the majority of the document will describe how graphics should be built up, so this division is not clear cut.

When an application contains text for human consumption the range of characters that may potentially be used covers the full range of Unicode, and indeed that is why XML is built on top of Unicode. US-ASCII simply doesn’t work for written communication in any language. However US-ASCII does cover enough of the English language that it can be used exclusively in the names used in an XML application if English is the language that is used to give them a degree of mnemonic user-friendliness.

A document of such an application could still exceed the range of US-ASCII if it contained text aimed a human reader (it’s one thing to use <facade> instead of <façade>, another to use <p>He naively assumed the Tudor facade was authentic.</p> instead of <p>He naïvely assumed the Tudor façade was authentic.</p>, never mind how badly this would cope with most other languages).

However in the case of a machine-targeted application the text might naturally fit within the US-ASCII range; looking at the W3C XML Schema specification [Schema] (although the same applies to documents not defined in terms of such schemata) the various ISO 8601-based date and time formats, the numerical formats, Boolean format, base64 and hex encodings of binary data, and language identifiers are all composed exclusively from US-ASCII characters. Further URIs QNames, and tokens that are defined by an application can, and arguably if English is the source of the mnemonics should, be composed solely of US-ASCII characters. (Also while Internationalised URIs can contain characters outside of the US-ASCII range, URIs cannot).

As such there are a large number of XML applications where US-ASCII will naturally contain all of the characters used. This can be taken advantage of when coding software for such applications. In all though, I recommend that software, particularly if it will be parsing rather than outputting, be built on top of XML APIs and not try to coerce things into a US-ASCII model. Still, if people would only work with XML as US-ASCII in the cases where it’s feasible that would be an improvement.

Summary

Bah! I’m really bad at writing summaries. Anyway hopefully this article has been useful to some and with luck even interesting enough to encourage people to look at the full standard (if nothing else you can while the hours away flicking through the massive code chart that forms chapter 16 discovering interesting characters you didn’t know of before).

Appendices

Appendix A — Example Character Property Lookup Code

The following C++ code demonstrates two-stage lookups into packed bytes (essentially making it a three-stage lookup), to discover whether a given character is a valid XML character, a valid character for an XML Name or a valid character for the start of an XML Name.

To begin with the code needs a datatype to hold the code point of the character in question. The best choice will depend upon your machine so we’ll begin with a typedef to enable this to be easily changed in one place:

typedef unsigned int uchar_t;

With my compiler and on my machine this is a 32-bit unsigned datatype; on other architectures a different type may be more suitable. Note that uchar_t is quite an obvious choice for such a type and could potentially clash with other code; we would of course use namespaces to protect against these clashes.

Now to begin with our two-stage lookup needs two arrays to lookup. The obvious approach is to combine these together as members of a class. Since we’re going to need three such classes (more if we apply the same technique to other properties one might want to determine) we can gain from using a template here.

template<size_t offsetBits, size_t resultTableSize> class twoStageLookup{
public:
    const unsigned char offsets[1 << offsetBits];
    const unsigned char packedBytes[resultTableSize];
};

Three things are noteworthy at this point:

  1. We define the size of the first array in terms of the number of bits used to index into it, rather than as the size itself. This is because we’ll reuse that number later.
  2. Doing this means that we will have values in this first table for the full range of [0 - 20000016) addressable with 21 bits, not just the range [0 - 11000016) used by Unicode. This is a trade-off in terms of how safe we make the code. We won’t do any error-checking on the values passed to this function — this is low-level code and a calling function that has already performed this checking, or obtained values from an encoding in which it is mathematically impossible to exceed the range shouldn’t suffer penalties — however we don’t want to access past the limits of the array should there be such a bogus value passed, so we want to change invalid values to valid (if incorrect) ones. It’s generally faster to ensure a value is below 20000016 than below 11000016 as we can do so without branching, and this is what we’ll do.
  3. The two member arrays are public. This is undesirable, but necessary to ensure that we can initialise them statically with an initialiser list. Making them const avoids the more disastrous possible pitfalls (as well as allowing the structure to be stored in read-only memory).

Now we’re going to add some constants for use within the lookup, I’ll explain these as we come to them:

template<size_t offsetBits, size_t resultTableSize> class twoStageLookup{
    enum privateConstants{
        BITMASK21 = 0x1FFFFF,                   //Lower 21bits set.
        OFFSET_SHIFT = 21 - offsetBits,         // Shift to leave index for first array.
        BLOCK_SIZE = (1 << OFFSET_SHIFT),       // Size of a block within the second array.
        PACKED_SHIFT = 3,                       // Shift to exclude index into an octet.
        PACKED_MASK = (1 << PACKED_SHIFT) - 1,  // = 0x7 -> masks index into an octet.
        RESULT_MASK = 0x1                       // Single bit.
    };
public:
    const unsigned char offsets[1 << offsetBits];
    const unsigned char packedBytes[resultTableSize];
};

Now, we need a function to actually perform the lookup, and this might as well be a member function. I’ll take this step-by-step.

First we ensure that we can’t end up trying to access beyond the limits of the arrays. This minimal approach to error-checking leaves us with a GIGO result for arguments that are out of range, but it does stop anything more disastrous — like an error for accessing an invalid memory address:

bool lookup(uchar_t codePoint) const throw()
{
    codePoint &= BITMASK21; //Ensure only lower 21bits are set

Next we take the higher bits of the code point and use this as an index into the first array, this gives us the index of a block of values in the second array. Note that OFFSET_SHIFT is determined by the size of the first array.

    size_t offsetPart = codePoint >> OFFSET_SHIFT; //Obtain index into first table.
    size_t offset = offsets[offsetPart]; //First table contains index into second table.

We then take this offset, multiply it by the size of each block of values in the second array and add the remaining bits, except the lowest 3. This gives us an index into the second array of a byte that contains the value for 8 consecutive values:

    // Get packed byte from second table:
    unsigned char packedByte = packedBytes[(offset * BLOCK_SIZE + (codePoint & (BLOCK_SIZE - 1))) >> PACKED_SHIFT];

Finally we check the value of the bit identified by the lowest 3 bits. This will be 1 or 0 depending on whether the value in question is true of false for that code point.

    //Shifting the packed byte this much will leave the bit in question as the lowest bit.
    size_t slideIndex = (codePoint & PACKED_MASK);

    return ((packedByte >> slideIndex) & RESULT_MASK) != 0; //Return value of that bit.

Putting all this together gives us:

bool lookup(uchar_t codePoint) const throw()
{
    codePoint &= BITMASK21; //Ensure only lower 21bits are set
    size_t offsetPart = codePoint >> OFFSET_SHIFT; //Obtain index into first table.
    size_t offset = offsets[offsetPart]; //First table contains index into second table.

    // Get packed byte from second table:
    unsigned char packedByte = packedBytes[(offset * BLOCK_SIZE + (codePoint & (BLOCK_SIZE - 1))) >> PACKED_SHIFT];

    //Shifting the packed byte this much will leave the bit in question as the lowest bit.
    size_t slideIndex = (codePoint & PACKED_MASK);

    return ((packedByte >> slideIndex) & RESULT_MASK) != 0; //Return value of that bit.
}

All the above is inline, we’re pretty much forced to do so with most current template implementations. Given that this code would quite often be called for every character in a large document we probably want that optimisation anyway.

Okay, we’ve a mechanism for storing the data and accessing it, but we need some actual data. Producing this really needs a simple program that calculates the values by operating a simpler implementation of the production (I used a big messy bunch of if statements) against each possible number, and then produces the source code.

In a header file we’ll put the following:

extern const twoStageLookup<10, 1024> isXMLCharData;

inline bool isXMLChar(uchar_t codePoint) throw()
{
    return isXMLCharData.lookup(codePoint);
}

The template parameters of isXMLCharData depend on the results of our script for generating the next piece of code, which we put into a .cpp file. The function isXMLChar() is just a simple inline wrapper for convenience of use.

Here’s the code that actually contains the data to for the lookup:

#include "twoStageLookup.h"

const twoStageLookup<10, 1024> isXMLCharData =
{
    {
        2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+0000 to U+7FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 3, /* U+8000 to U+FFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+10000 to U+17FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+18000 to U+1FFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+20000 to U+27FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+28000 to U+2FFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+30000 to U+37FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+38000 to U+3FFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+40000 to U+47FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+48000 to U+4FFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+50000 to U+57FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+58000 to U+5FFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+60000 to U+67FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+68000 to U+6FFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+70000 to U+77FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+78000 to U+7FFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+80000 to U+87FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+88000 to U+8FFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+90000 to U+97FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+98000 to U+9FFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+A0000 to U+A7FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+A8000 to U+AFFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+B0000 to U+B7FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+B8000 to U+BFFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+C0000 to U+C7FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+C8000 to U+CFFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+D0000 to U+D7FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+D8000 to U+DFFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+E0000 to U+E7FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+E8000 to U+EFFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+F0000 to U+F7FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+F8000 to U+FFFFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+100000 to U+107FFF */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* U+108000 to U+10FFFF *//*Limit of valid Unicode*/

        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+110000 to U+117FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+118000 to U+11FFFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+120000 to U+127FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+128000 to U+12FFFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+130000 to U+137FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+138000 to U+13FFFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+140000 to U+147FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+148000 to U+14FFFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+150000 to U+157FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+158000 to U+15FFFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+160000 to U+167FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+168000 to U+16FFFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+170000 to U+177FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+178000 to U+17FFFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+180000 to U+187FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+188000 to U+18FFFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+190000 to U+197FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+198000 to U+19FFFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+1A0000 to U+1A7FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+1A8000 to U+1AFFFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+1B0000 to U+1B7FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+1B8000 to U+1BFFFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+1C0000 to U+1C7FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+1C8000 to U+1CFFFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+1D0000 to U+1D7FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+1D8000 to U+1DFFFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+1E0000 to U+1E7FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+1E8000 to U+1EFFFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* U+1F0000 to U+1F7FFF */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0  /* U+1F8000 to U+1FFFFF */
    },
    {
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,

        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,

        0x00, 0x26, 0x00, 0x00, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0000 to U+007F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0080 to U+00FF */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0100 to U+017F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0180 to U+01FF */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0200 to U+027F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0280 to U+02FF */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0300 to U+037F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0380 to U+03FF */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0400 to U+047F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0480 to U+04FF */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0500 to U+057F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0580 to U+05FF */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0600 to U+067F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0680 to U+06FF */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0700 to U+077F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+0780 to U+07FF */

        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+F800 to U+F87F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+F880 to U+F8FF */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+F900 to U+F97F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+F980 to U+F9FF */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+FA00 to U+FA7F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+FA80 to U+FAFF */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+FB00 to U+FB7F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+FB80 to U+FBFF */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+FC00 to U+FC7F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+FC80 to U+FCFF */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+FD00 to U+FD7F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+FD80 to U+FDFF */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+FE00 to U+FE7F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+FE80 to U+FEFF */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* U+FF00 to U+FF7F */
        0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x3F  /* U+FF80 to U+FFFF */
    }
};

Of course the values returned for code points above 10FFFF16 don’t really matter, but since we have a block of zeros already we might as well use it to return false for such calls.

The equivalent code for determining if a character can be used in an XML Name or to start one looks much the same as the above. There's nothing new to learn by looking at the code but you can download the code for isXMLNameData and isXMLNameStartData if you want to experiment with the code yourself.

The final version of the header file, incorporating what we’ve already described with the declarations and helper functions for the last two lookup-tables, is as follows:

template<size_t offsetBits, size_t resultTableSize> class twoStageLookup{
    enum privateConstants{
        BITMASK21 = 0x1FFFFF,                   //Lower 21bits set.
        OFFSET_SHIFT = 21 - offsetBits,         // Shift to leave index for first array.
        BLOCK_SIZE = (1 << OFFSET_SHIFT),       // Size of a block within the second array.
        PACKED_SHIFT = 3,                       // Shift to exclude index into an octet.
        PACKED_MASK = (1 << PACKED_SHIFT) - 1,  // = 0x7 -> masks index into an octet.
        RESULT_MASK = 0x1                       // Single bit.
    };
public:
    const unsigned char offsets[1 << offsetBits];
    const unsigned char packedBytes[resultTableSize];

    bool lookup(uchar_t codePoint) const throw()
    {
        codePoint &= BITMASK21; //Ensure only lower 21bits are set
        size_t offsetPart = codePoint >> OFFSET_SHIFT; //Obtain index into first table.
        size_t offset = offsets[offsetPart]; //First table contains index into second table.

        // Get packed byte from second table:
        unsigned char packedByte = packedBytes[(offset * BLOCK_SIZE + (codePoint & (BLOCK_SIZE - 1))) >> PACKED_SHIFT];

        //Shifting the packed byte this much will leave the bit in question as the lowest bit.
        size_t slideIndex = (codePoint & PACKED_MASK);

        return ((packedByte >> slideIndex) & RESULT_MASK) != 0; //Return value of that bit.
    }
};

extern const twoStageLookup<10, 1024> isXMLCharData;
extern const twoStageLookup<11, 1664> isXMLNameStartData;
extern const twoStageLookup<11, 1664> isXMLNameData;

inline bool isXMLChar(uchar_t codePoint) throw()
{
    return isXMLCharData.lookup(codePoint);
}

inline bool isXMLNameStartChar(uchar_t codePoint) throw()
{
    return isXMLNameStartData.lookup(codePoint);
}

inline bool isXMLNameChar(uchar_t codePoint) throw()
{
    return isXMLNameData.lookup(codePoint);
}

Variations

Some variations on the above include using more than two stages in the lookup, using packed bytes in the first table, and not providing for lookups above 10FFFF16. Such variations are mainly trade-offs between speed and size.

Some people might prefer to use a binary resource rather than compiled-in arrays.

Because the first tables for looking up both whether a character can be in an XML Name character and whether it can start one are identical they can be shared with a bit of alteration of the approach taken.

The template can be made more general to allow it to store character properties which aren’t Boolean, which is the case with many of the properties defined by the Unicode Standard.

Appendix B — Internationalised URIs

URIs were designed largely in terms of the US-ASCII character set, but with provision to allow for the encoding of arbitrary octets and hence provision for transmitting characters from other character sets.

To provide for Unicode characters Internationalised URIs use UTF-8 to provide these octets.

Some XML applications use such Internationalised URIs, in the XML document they are not escaped except for those characters that are significant within URIs themselves. To convert the Internationalised URI to a URI you:

  1. Convert each disallowed character to UTF-8 as one or more octets.
  2. Escape these octets using the URI escaping mechanism; that is, converted to %HH where HH is the two hexadecimal digits of the octet value.
  3. Replace the disallowed character with the resulting character sequence.

This mechanism has the same security issues as UTF-8 in general, indeed the example of an exploit of poorly handled UTF-8 [rain forest puppy] is actually an Internationalised URI exploit.

References

The following are not just references, but also recommended further reading.

XML1.0
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, François Yergeau,
Extensible Markup Language (XML) 1.0 (Third Edition) <http://www.w3.org/TR/REC-xml>.
You should have already read this, but if this article was of use to you then it might be a good idea to re-read the XML specification as some parts may now be clearer.
XML1.1
John Cowan, François Yergeau, Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler,
Extensible Markup Language (XML) 1.1 <http://www.w3.org/TR/xml11/>.
This is a Proposed Recommendation at the time of writing. Particularly notable in this context is the change of the productions for which characters are allowed, for which are allowed to start or to be in a name, and for which are considered newline characters.
Unicode
Unicode Consortium,
The Unicode Standard, Version 4.0,
also available online <http://www.unicode.org/versions/Unicode4.0.0/>.
Read it online, buy it, pester your boss into getting a copy for the office, but do read it. Most of the material in this article is based on the first 5 chapters, and of course it’s to be trusted over me. Hopefully if this article has done anything it’s convinced people to tackle the whole Standard, or at least the earlier more “techie” chapters.
ISO 10646
International Organisation for Standardisation, ISO/IEC 10646-1:2000
Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 1: Architecture and Basic Multilingual Plane
.
International Organisation for Standardisation, ISO/IEC 10646-1:2000/Amd 1:2002
Mathematical symbols and other characters
.
International Organisation for Standardisation, ISO/IEC 10646-2:2001
Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 2: Supplementary Planes
.
UnicodeXML
Martin Dürst and Asmus Freytag,
Unicode in XML and other Markup Languages,
Unicode Technical Report #20 <http://www.unicode.org/reports/tr20/>,
W3C Note <http://www.w3.org/TR/unicode-xml/>.
rain forest puppy
rain forest puppy,
IIS %c1%1c remote command execution,
Reported to Bugtraq <http://seclists.org/lists/bugtraq/2000/Oct/0264.html> and elsewhere, if memory serves rain forest puppy also had documentation of this hole on his own site, but it appears to have disappeared.
UAX 15
Mark Davis, Martin Dürst, Unicode Normalization Forms <http://www.unicode.org/reports/tr15/>.
Charmod
Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf, Tex Texin,
Character Model for the World Wide Web 1.0 <http://www.w3.org/TR/charmod/>.
This is currently a Working Draft, but is still very definitely worth reading.
RFC 2396
T. Berners-Lee, R. Fielding, U.C. Irvine, L. Masinter,
Uniform Resource Identifiers (URI): Generic Syntax <http://www.ietf.org/rfc/rfc2396.txt>.
IURI
Larry Masinter, Martin Dürst,
Internationalized Uniform Resource Identifiers (IURI) <http://www.w3.org/International/2000/03/draft-masinter-url-i18n-05.txt>
(expired Internet-Draft).
Steve DeRose, Eve Maler, David Orchard,
XML Linking Language (XLink) Version 1.0 <http://www.w3.org/TR/xlink/>,
(in particular see §5.4’s description of Internationalised URIs).
IANA Char Reg
IANA,
Character Sets <http://www.iana.org/assignments/character-sets>,
This is a list of character encodings that have been registered with IANA.
C14N
John Boyer,
Canonical XML Version 1.0 <http://www.w3.org/TR/xml-c14n>.
Schema
Henry S. Thompson, David Beech, Murray Maloney, Noah Mendelsohn,
XML Schema Part 1: Structures <http://www.w3.org/TR/xmlschema-1/>.
Paul V. Biron, Ashok Malhotra,
XML Schema Part 2: Datatypes <http://www.w3.org/TR/xmlschema-2/>.

The following were used as references when stating what was out of scope for this article. As such they don't touch much on the main topic here, but are worth looking at as either solutions in themselves for combining XML with user input and rendering, or to learn from how they cope with that matter.

XHTML
Steven Pemberton, Daniel Austin, Jonny Axelsson, Tantek Çelik, Doug Dominiak, Herman Elenbaas, Beth Epperson, Masayasu Ishikawa, Shin’ichi Matsui, Shane McCarron, Ann Navarro, Subramanian Peruvemba, Rob Relyea, Sebastian Schnitzenbaumer, Peter Stark,
XHTML™ 1.0 The Extensible HyperText Markup Language (Second Edition) <http://www.w3.org/TR/xhtml1/>.
Mark Baker, Masayasu Ishikawa, Shinichi Matsui, Peter Stark, Ted Wugofski, Toshihiko Yamakami,
XHTML™ Basic <http://www.w3.org/TR/xhtml-basic/>.
Murray Altheim, Shane McCarron,
XHTML™ 1.1 - Module-based XHTML <http://www.w3.org/TR/xhtml11/>.
Jonny Axelsson, Beth Epperson, Masayasu Ishikawa, Shane McCarron, Ann Navarro, Steven Pemberton,
XHTML™ 2.0 <http://www.w3.org/TR/xhtml2/>,
Currently a Working Draft.
Dave Raggett, Arnaud Le Hors, Ian Jacobs,
HTML 4.01 Specification <http://www.w3.org/TR/html401/>,
Not a version of XHTML, but the XHTML™ 1.0 spec builds on this and assumes knowledge of it.
Richard Ishida,
Authoring Techniques for XHTML & HTML Internationalization 1.0 <http://www.w3.org/TR/i18n-html-tech/>.
Quite an early Working Draft but full of information of XHTML Internationalisation issues.
MathML
David Carlisle, Patrick Ion, Robert Miner, Nico Poppelier, Ron Ausbrooks, Stephen Buswell, David Carlisle, Stéphane Dalmas, Stan Devitt, Angel Diaz, Max Froumentin, Roger Hunter, Patrick Ion, Michael Kohlhase, Robert Miner, Nico Poppelier, Bruce Smith, Neil Soiffer, Robert Sutor, Stephen Watt
Mathematical Markup Language (MathML) Version 2.0 (Second Edition) <http://www.w3.org/TR/MathML2/>.
SVG
Jon Ferraiolo, 藤沢 淳, Dean Jackson,
Scalable Vector Graphics (SVG) 1.1 Specification <http://www.w3.org/TR/SVG11/>.
XForms
Micah Dubinko, Leigh L. Klotz, Jr., Roland Merrick, T. V. Raman,
XForms 1.0 <http://www.w3.org/TR/xforms/>.
Jon Hanna This work is licensed under a Creative Commons License. Email Me PGP Key Foaf Document Plink FOAF Directory Orkut Profile Hacker Amazon Wishlist Thinkgeek Wishlist

RDF Metadata
Valid XHTML 1.0 Valid CSS UTF-8 Encoded
Amazon.com Associate Amazon.co.uk Associate
Viewable in any browser Best viewed at a resolution you like!
Coded By Hand Uploaded Using Filezilla