UTF-8 Encoding


Summary

UTF-8 is a compromise character encoding that can be as compact as ASCII (if the file is just plain English text) but can also contain any unicode characters (with some increase in file size).

UTF stands for Unicode Transformation Format. The '8' means it uses 8-bit blocks to represent a character. The number of blocks needed to represent a character varies from 1 to 4.

One of the really nice features of UTF-8 is that it is compatible with nul-terminated strings. No character will have a nul (0) byte when encoded. This means that C code that deals with char[] will "just work".

You can try the UTF-8 Test Page to see how well your browser (and default font) support UTF-8.

If you are an application developer, this Joel On Software article on Unicode is pretty good summary of all you need to know.

More links:


Detail

For any character equal to or below 127 (hex 0x7F), the UTF-8 representation is one byte. It is just the lowest 7 bits of the full unicode value. This is also the same as the ASCII value.

For characters equal to or below 2047 (hex 0x07FF), the UTF-8 representation is spread across two bytes. The first byte will have the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The second byte will have the top bit set and the second bit clear (i.e. 0x80 to 0xBF).

For all characters equal to or greater than 2048 but less that 65535 (0xFFFF), the UTF-8 representation is spread across three bytes.

The following table shows the format of such UTF-8 byte sequences (where the "free bits" shown by x's in the table are combined in the order shown, and interpreted from most significant to least significant).

Binary format of bytes in sequence

1st Byte 2nd Byte 3rd Byte 4th Byte Number of Free Bits Maximum Expressible Unicode Value
0xxxxxxx 7 007F hex (127)
110xxxxx 10xxxxxx (5+6)=11 07FF hex (2047)
1110xxxx 10xxxxxx 10xxxxxx (4+6+6)=16 FFFF hex (65535)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (3+6+6+6)=21 10FFFF hex (1,114,111)

The value of each individual byte indicates its UTF-8 function, as follows:

  • 00 to 7F hex (0 to 127): first and only byte of a sequence.
  • 80 to BF hex (128 to 191): continuing byte in a multi-byte sequence.
  • C2 to DF hex (194 to 223): first byte of a two-byte sequence.
  • E0 to EF hex (224 to 239): first byte of a three-byte sequence.
  • F0 to FF hex (240 to 255): first byte of a four-byte sequence.

UTF-8 remains a simple, single-byte, ASCII-compatible encoding method, as long as no characters greater than 127 are directly present. This means that an HTML document technically declared to be encoded as UTF-8 can remain a normal single-byte ASCII file. The document can remain so even though it may contain Unicode characters above 127, as long as all characters above 127 are referred to indirectly by ampersand entities.

Examples of encoded Unicode characters (in hexadecimal notation)

16-bit Unicode UTF-8 Sequence
0001 01
007F 7F
0080 C2 80
07FF DF BF
0800 E0 A0 80
FFFF EF BF BF
010000 F0 90 80 80
10FFFF F4 8F BF BF