Talk:Plane (Unicode)

Latest comment: 3 years ago by Drmccreedy in topic Aramaic Scripts

0x00E00000 to 0x00FFFFFF/0x60000000 to 0x7FFFFFFF edit

Some operating systems still have these as private use areas. 108.71.123.25 (talk) 16:07, 5 October 2016 (UTC)Reply

But those are not Unicode planes, the subject of this article. The Unicode standard sets a maximum of 17 planes. There is nothing to stop people from storing other values in 32 bits, but that's not Unicode. -- Elphion (talk) 16:13, 5 October 2016 (UTC)Reply
Universal Character Set still has this. Some operating systems still have these. My flip phone has one such operating system that uses UTF-32/UCS-4, and it shows an 8 digit code point. 108.71.123.25 (talk) 16:17, 5 October 2016 (UTC)Reply
No, UCS was revised to agree with Unicode, for consistency. Whatever your flip phone uses is not Unicode, and not UCS-4, no matter how it might be labeled. -- Elphion (talk) 16:21, 5 October 2016 (UTC)Reply
When I can enter text, it displays 0x00000021 and highlights the space. This 8 digit code point means that it is a 32 bit code space. 108.71.123.25 (talk) 16:25, 5 October 2016 (UTC)Reply
As I said, nothing prevents a programmer from storing arbitrary values in 32 bits. That doesn't make them Unicode, which has a very precise and well-documented definition that caps the space at U+10FFFF. The number of leading zeroes shown in the display doesn't alter that. Added:  If in fact your phone uses values above U+10FFFF, it was programmed to use a non-standard extension of Unicode, which (since Unicode is capped) is reasonably safe, in the sense that those private characters will never be assigned conflicting Unicode values. But the programmer would have no expectation that the non-standard values would be understood beyond the phone's universe. Such a message sent to another phone from a different manufacturer (or a different revision level) likely won't display as intended. -- Elphion (talk) 16:57, 5 October 2016 (UTC)Reply
I scrolled through the characters. The map starts at 0x00000020 and ends at 0x0002FA1D. 108.71.123.25 (talk) 17:41, 5 October 2016 (UTC)Reply
Regardless of what encoding scheme you phone uses, your changes will be reverted because they contradict the actual Unicode Standard and that's what this article is about. See chapter 2.4 of the Standard:

In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF, comprising 1,114,112 code points available for assigning the repertoire of abstract characters.

Anything outside of that codespace isn't Unicode and isn't relevant to this article. DRMcCreedy (talk) 18:02, 5 October 2016 (UTC)Reply
In the "Help" display for entering characters, it says "...to select the UTF-32/UCS-4 character..." 108.66.233.59 (talk) 18:04, 5 October 2016 (UTC)Reply
And, according to your own experiment, it does not go beyond the Unicode space: it stops at U+2FA1D, which is well below U+10FFFF. So although your phone is using a 32-bit display (or a 31-bit display, it's hard to tell when the highest digit is 0), it is only dealing in characters within the Unicode space. -- Elphion (talk) 18:21, 5 October 2016 (UTC)Reply
It's a 32 bit display. 108.66.233.59 (talk) 18:34, 5 October 2016 (UTC)Reply
Also, if I click this link on my flip phone, and I enter a number higher than 0x0010FFFF, it displays a box with the code point in it. For example, if I enter 0x60000000, it displays this:
+----+
|6000|
|0000|
+----+

108.71.120.43 (talk) 22:29, 10 October 2016 (UTC)Reply

And that shows nothing, except that your cellphone and the internet app do not screen out non-standard input. Neither your cellphone nor the app at unicodelookup.com constitutes a WP:RS. As we have all been telling you, the standard is quite clear: there are no valid code points above U+10FFFF. -- Elphion (talk) 22:54, 10 October 2016 (UTC)Reply

If I click the same link on my Windows computer or android phone, the site says undefined which is what it is. There is in my assessment too little unused space in BMP to be able to extend UTF-16 into 6 bytes. Otherwise a new type of high surrogate could be allocated as the first of 3 16-bit words.--BIL (talk) 14:59, 8 January 2017 (UTC)Reply

Math error, request confirmation/correction by Unicode standards expert. edit

When I tally the total nubmer of Code Points available from the three Private Use ranges I get four (4) more than is indicated by the summary at the top of this article.

- par.3: .... 137,468 are reserved for private use, leaving 974,530 for public assignment.
- par.4: .... 65,536 code points (Supplementary Private Use Area-A and -B, which constitute the entirety of planes 15 and 16).
Basic Multilingual Plane:
- par.4: As of Unicode 12.1, the BMP comprises the following 163 blocks:
 o ....
 o Private Use Area (E000–F8FF)
 F8FFhex	63743  end of BMP Private Use Block
-DFFFhex	57343  end of preceeding Surrogate Block
=============
 1900hex 6400 code points in BMP Private Use Block
   6,400 Private Use Block in Unicode Plane 0 (BMP)
+ 65,536 Private Use Block in Unicode Plane 15 (PUA-A)
+ 65,536 Private Use Block in Unicode Plane 16 (PUA-B)
========
 137,472 tally of the three (3) Private Use Blocks
 137,472 tally of the three (3) Private Use Blocks
-137,468 code points referenced in introduction to this article
========
       4 less code points in Intro than calculated from tallies of the 3 Blocks

Tree4rest (talk) 23:46, 24 September 2019 (UTC)Reply

Is this possibly caused by the xxFFFE and xxFFFF code points in the PUA planes?Spitzak (talk) 23:54, 24 September 2019 (UTC)Reply
Yes. Although each plane has 65,536 = 2^16 code points, the last two in each plane are permanently declared non-characters. So only 65,534 are available for (any) use in planes 15 and 16. -- Elphion (talk) 00:35, 25 September 2019 (UTC)Reply
Actually, even if the last two characters in each planes are declared "non-characters", they are valid codepoints and can be encoded, say with UTF-8, even if the encoded text is non-conforming. The same is true for the few non-characters assigned inside the Arabic forms near the end of the BMP. Being "non-characters" means that they are not useful for encoding text for interchange, but they can still be used *locally* as special-purpose marks inside applications, or libraries, or renderers, to facilitate their implementation (and they are used for that: on input texts are filtered and either non-characters may be filtered out, or the whole document would be rejected as invalid, or they could be replaced by a placefolder; but internally, they can then be freely used for the implementation that should then still not emit transformed texts containing them because these texts would become fully rejected by the recipient).
Those non-characters have then NO meaning (like PUA) but more restricted than PUA because their interfhange in conforming text documents is invalid (for example the non-characters must NOT be present in documents conforming to standards like HTML or XML or JSON. And varoous applications or libraries will reject them if they ever detect them: for example a filesystem API may detect an encoding error and filesystem inconsistancy, or desynchrinization problems, or data corruption in the media, and the filesystem could refuse to mount such filesystem and won't grant any write access without specific permission: a special maintenance will be needed, that cannot be automated as it could cause security issues or corruption of important data which is not supposed to be text, and could be an encrypted binary file - "repairing" the filesystem by replacing/dropping those characters could damage the data or invalidate its binary signature)
Non-characters are very useful as they can be used to detect corruptions, or access violations, or failure in communication or storage protocols: they can then be used as guards (notably the last two codepoints at end of each plane), for example to create a binary container formats multiplexing text parts and binary parts, all with variable nelgth (e.g. inside encoded video streams like audio/video/image formats, including JPEG, MPEG, PNG, Webm, Ogg and others where text framents may be present for tagging metadata, or subtitles, or titles, or licensing and copyright statements, or to embed URIs or HTML, XML and JSON documents). There are not many non-characters, but still they are valid codepoints (meaning that they can be transformed bijectively between all conforming UTFs; It is not the case for surrogates that don't have this bijective capability, so there's no roundtrip conversion (the roundtrip does not work with two successive surrogates, it only works with isolated surrogates, which are still forbidden in conforming texts: surrogates do not have any value even if they have a codepoint assigned to them, only to implement UTF-16; if UTF-16 was not part of the standard, there would be NO surrogate at all in the BMP, but there would still remain non-characters). verdy_p (talk) 02:59, 17 October 2020 (UTC)Reply
The artical count is correct. The two non-characters at the end of each plane are specifically excluded from PUA-A and PUA-B per The Unicode Standard (https://www.unicode.org/versions/Unicode13.0.0/ch23.pdf#G19378). DRMcCreedy (talk) 03:43, 17 October 2020 (UTC)Reply

Aramaic Scripts edit

Wouldn't the group heading "Aramaic Scripts" be more accurately named "Semitic Scripts" (or more pedantically, "Scripts used with Semitic languages")? Is the heading "Aramaic Scripts" an official designation made by a committee? 2601:602:8580:5E00:A8FA:C0F1:EB3C:C8C6 (talk) 07:56, 11 January 2021 (UTC)Reply

It's not a Unicode designation. If you look at The Unicode Standard (http://www.unicode.org/versions/Unicode13.0.0/) you'll see that they're grouped geographically. Hebrew, Arabic, Syriac, Samaritan, and Mandaic are in Chapter 9: Middle East-I, Modern and Liturgical Scripts. Thaana is in Chapter 13: South and Central Asia-II. And N'Ko is in Chapter 19: Africa. I'm not weighing in on the Aramaic vs Semitic question, just that it's not a Unicode designation. DRMcCreedy (talk) 18:25, 11 January 2021 (UTC)Reply