UTF-8 encoding why prefix 10?
09:38 26 Oct 2018

As far as I know UTF-8 is a variable-length encoding, i.e. a character can be represented as 1 byte, 2 bytes, 3 bytes or 4 bytes.

For example the Unicode character U+00A9 = 10101001 is encoded in UTF-8 as

11000010 10101001, i.e. 0xC2 0xA9

The prefix 110 in the first byte indicates that the character is stored with two bytes (because I count two ones until zero in the prefix 110).

The prefix in the following bytes starts with 10

A 4-byte UTF-8 encoding would look like

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

prefix 11110 (four ones and zero) indicates four bytes and so on.

Now my question:

Why is the prefix 10 used in the following bytes? What is the advantage of such a prefix? Without 10 prefix in the following bytes I could use 3*2=6 bits more if I write:

11110000 xxxxxxxx xxxxxxxx xxxxxxxx

unicode encoding utf-8 character