Archive 1

Typo in "Supplementary Special-purpose Plane" ??

The section "Supplementary Special-purpose Plane" includes the line:

Variation Selectors Supplement (0E0100–E01EF)

That zero in front of the first hex number sure looks wrong to me, but I honestly don't know enough about this topic to know if it serves some actual purpose. Would someone better informed please fix it if it's wrong, or say why it's right?

[1] 2010 edit. -DePiep (talk) 13:10, 7 October 2021 (UTC)

Old Hungarian

Turoslangos is playing games here. Neither UTC nor WG2 will accept Old Hungarian into the BMP. There isn't room, and neither is there justification for encoding it there. -- Evertype· 21:04, 7 November 2008 (UTC)

Private Use Area planes for social networks

I've been finding HTML documents with glyphs for Facebook, Twitter, etc. as Unicode characters in the Private Area Use planes. This requires a custom font. Any references on this? --John Nagle (talk) 20:46, 30 April 2013 (UTC)

As the definition goes: anyone can publish or use a character definition in PUA space (example: I may have a PUA character to mail to my spouse to say X, and only we two know. We don't see the font, but the char number is enough for us to meet). If FB or TWI does so, it is up to them to provide the font, and to make it work publicly. If they can't get that right, the reader will see the wrong character. Like in the old day: question marks at best.
Actually, is that so? Examples by FB or TWI? It could be users/companies are useing PUAs (writing on FB or TWI), but then the issue is with these users. -DePiep (talk) 21:08, 30 April 2013 (UTC)

UTF-8 "designed for 2^21 bits"

The UTF-8 coding scheme was designed when Unicode was still contemplating a 31-bit space. It was not "designed" for a limit of 2^21 codepoints, and was eventually restricted to a much smaller number anyway (0x10FFFF). Elphion (talk) 01:13, 3 October 2016 (UTC)

Why would Unicode modernize a code space by making it smaller? 108.71.123.25 (talk) 16:05, 5 October 2016 (UTC)
Because otherwise the parties could not agree on a standard. Too many manufacturers were already heavily invested in 16-bit characters. UTF-16 was the compromise that allowed the standard to go forward. When eventually we run out of space (and we will, though computing technology will have changed a lot by the time that happens), larger spaces will be introduced. But they will not be "Unicode". -- Elphion (talk) 16:18, 5 October 2016 (UTC)
But 0x00E00000 to 0x00FFFFFF and 0x60000000 to 0x7FFFFFFF were assigned! And my flip phone uses such an operating system that uses a 32 bit code space. 108.71.123.25 (talk) 16:21, 5 October 2016 (UTC)
(see below -- Elphion (talk) 16:23, 5 October 2016 (UTC))
When I can enter text on my flip phone, a character map with a code point above it is shown. It highlights the space and displays 0x00000020 in the top. This implies that it uses a 32 bit space. 108.71.123.25 (talk) 16:27, 5 October 2016 (UTC)

Plane 16 and "20-bit limit"

Obviously, Plane 16 (100000-10FFFF) is a 21-bit entity (why they crashed thru to Plane 16 with 3-13 unused seems rather inelegant here, but I'm not a Unicode expert. I can, however, decipher hexadecimal. I have no idea how to "improve" ("correct"?) this, but it needs to be done. Grndrush (talk) 17:18, 3 January 2009 (UTC)

I was about to say much the same. Is the answer to call it a 17-plane limit and ignore the bit-question? Alternatively one could explain that the 20-bit limit is a matter of the address space defined by the available surrogate pairs, and thus defines the number of planes available beyond the BMP. (If I have understood aright…) Ian Spackman (talk) 00:11, 28 July 2009 (UTC)
21 bit is just an outcome, it is not the preset limit. Here we go. BMP is defined the full 16 bit (hhhh): 0000-FFFF, ~65000 numbers. (So prefix is 00hhhh so Plane=0). IN this plane are defined 1024 high surrogates and 1024 low surrogates, at D800-DBFF and DC00-DFFF. Surrogates must be used in pairs (one high, one low) to point to a character. So they can identify exactly 1024x1024 ~1M points. Together they need hhhhlow.hhhhhigh is 32 bit. So the 1M points are within the range D800.DC00 - DBFF.DFFF (but not every point in that range).
In comes UTF-16. UTF-16 recalculates these 32bit numbers 1:1 into the range 10000-10FFFFhex, starting right after plane 0 (at FFFF+1), and exactly filled with the ~1M points, creating planes 1 to 16dec (=the final 10hex). Now there is no unused number any more, and the whole range can be identified with 21 bits.
So because there are 1024x1024 surrogates defined, the UTF-16 recalculated numbers fit exactly in a 21-bit range. Starting plane 17 at 10FFFF+1=110000 would need a 22nd bit, and cannot be recalculated to the high-low 32bit pair.
Nowadays the U+hhhhhh notation is used commonly. -DePiep (talk) 17:13, 6 October 2010 (UTC)
0xHHHHHHHH 108.71.120.43 (talk) 20:50, 10 October 2016 (UTC)