Talk:UTF-8

Latest comment: 3 months ago by Guy Harris in topic Microsoft script dead link

Table should not only use color to encode information (but formatting like bold and underline) edit

As in a previous comment https://en.wikipedia.org/wiki/Talk:UTF-8/Archive_1#Colour_in_example_table? this has been done before, and is *better* so that everyone can clearly see the different part of the code. Relying on color alone is not good, due to color vision deficiencies and varying color rendition on devices.

How is utf8mb3 exactly the same as CESU-8? edit

Spitzak, you've repeatedly asserted that MySQL UTF8mb3 and CESU-8 are exactly the same in the edit comments. I believe you, but I can't follow you, because the source materials seem to say otherwise, and the citations seem insufficient.

In Unicode Technical Report #26, CESU-8 is explicitly defined to support supplemental characters: "In CESU-8, supplementary characters are represented as six-byte sequences". Whereas the MySQL 8.0 Reference Manual explicitly states that supplemental characters are not supported: "Supports BMP characters only (no support for supplementary characters)". And the MySQL 3.23, 4.0, 4.1 Reference Manual (when utf8mb3 first appears, as "utf8") says the same: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP."

How do you reconcile these conflicting definitions of CESU-8 and utf8mb3? Is one of them wrong, or do they require further interpretation? If so, is that cited somewhere? I checked the citations, but I'm not seeing how they back up what you're saying -- they only seem to note that utf8mb3 doesn't support supplemental characters. If what you're saying is in fact true, I think further explication is needed beyond saying it is so, because the MySQL docs and UTR#26 seem to suggest that utf8mb3 and CESU-8 are definitionally different, at least when perused by a non-expert like myself trying to learn about the subject.

While I think the introductory paragraph is trying to shed some light, "many programs" is vague and not cited, and nor is it cited that MySQL is definitively one of those many programs, and nor is it cited that MySQL "transforms UCS-2 codes to three bytes or fewer" for utf8mb3. Does it? How do we know?

If what you're trying to say is that when UTF-16 supplemental characters are converted to UTF-8 as though they are UCS-2 (and not UTF-16), the result is what came to be called CESU-8, then I think you also need to say that while utf8mb3 is not intended to support supplemental characters at all, it functionally operates as CESU-8 if they are present. And ideally that should be backed up with a citation, or an example sufficient to demonstrate that this article is not the only place where one will find this assertion.

And, even if you're right that utf8mb3 and CESU-8 (and Oracle UTF8) are technically identical, it's still not correct to say that "MySQL calls [UTF-16 supplemental characters converted to UTF-8 as though they were UCS-2 characters] utf8mb3", because MySQL quite clearly defines utf8mb3 as being BMP-only; so MySQL is not "calling" anything involving supplemental characters utf8mb3.

Having now been trying to understand this for hours, I think this Oracle document explains it pretty well: "The UTF8 character set encodes characters in one, two, or three bytes...If supplementary characters are inserted into a UTF8 database...the supplementary characters are treated as two separate, user-defined characters that occupy 6 bytes." If what you're saying is correct (and I don't know that it is, because I don't have anything authoritative saying so), then it sounds like this could be equally applicable to utf8mb3. The article could make that clear, if properly cited or demonstrated.

TL;DR: It's not accurate to describe utf8mb3 as having any representation of supplemental characters, even if it can technically can do so as described by CESU-8, because it is defined otherwise. Further, claiming utf8mb3 is technically identical to CESU-8 warrants citation or demonstration, and the claim would benefit from greater clarity. Ivanxqz (talk) 00:45, 15 September 2020 (UTC)Reply

Both of then translate a UTF-16 supplemental pair into exactly the same 6 bytes, and unpaired surrogate halves into exactly the same 3 bytes, therefore they are identical.Spitzak (talk) 21:20, 15 September 2020 (UTC)Reply
Can you cite this anywhere? No original research, etc. The only source for your information is you. (And you haven't responded to anything that I wrote above, not even the TLDR -- even if technically identical, which you have only asserted and not cited, MySQL does not "call" CESU-8 "utf8mb3" as you state -- utf8mb3 explicitly does not support supplemental characters, and therefore any handling of them in the style of CESU-8 is an accident, not a design.) Ivanxqz (talk) 04:55, 16 September 2020 (UTC)Reply

I decided to rewrite the CESU-8 section for what I think is greater clarity and accuracy. I included that CESU-8 in utf8mb3 is possible (though unsupported), on the basis of Spitzak's claim that it's the case. I noted that it needs a citation. I think it's not actually true, though, on the basis of Bernt's counter-demonstration at Talk:CESU-8#Comments, which I also just verified myself, and also the original references regarding utf8mb3 in the previous version, but I'll leave it for now. (Spitzak? Can you show somewhere why your claim that utf8mb3 can support supplemental characters via CESU-8 is accurate?)

I also gave utf8mb3 its own section again, since it is definitionally not CESU-8, even if technically it's the same thing (which, again, I don't think it is). It's like saying that Mountain Standard Time and Pacific Daylight Time are the same thing; they represent the exact same time of day in California and Arizona in the winter, but they're not the same thing, because they have different definitions. Ivanxqz (talk) 10:53, 16 September 2020 (UTC)Reply

Adoption and non-adoption edit

https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful

Under "Adoption": "Internally in software usage is even lower, with UCS-2, UTF-16, and UTF-32 in use, particularly in the Windows API, but also by Python". What I don't like about this is that Windows API only has a Unicode API for one encoding (plus legacy; codepages). It used to be UCS-2 (in now discontinued Windows versions, I believe they all are), but it's now UTF-16. And it doesn't have direct indexing, to Unicode characters so what follows isn't too helpful (it's outdated from UCS-2 era): "This is due to a belief that direct indexing of code points is more important than 8-bit compatibility". I think we should concentrate first on the main alternative to UTF-8 in use, UTF-16, then possibly explain programming languages. Since there are many and that text misrepresents Python (it also stores Latin1 internally) maybe just leave it out? Just as text on other encodings such as GB 18030 were moved to another page, possibly we need not mention all UTF-8 alternatives, or what all programming languages do, e.g. Python, as it's not strictly about adoption, rather non-adoption? comp.arch (talk) 12:34, 26 March 2021 (UTC)Reply

In the work I do, the #1 impediment to using UTF-8 is that Qt uses UTF-16. The #2 impediment is that Python does not use UTF-8, in our code it uses UTF-32, though you are correct that they are trying to improve this to some selection between 8,16, and 32 bit storage of UTF-32 based on the highest code point value, and also by caching a UTF-8 version as they have finally realized the cost of conversion. It is also quite likely the underlying reason Python and Qt don't use UTF-8 is because of the Windows API using UTF-16, so for me that is the #3 reason (though for Windows programmers it probalby is #1). In any case Python and Qt are extremely similar in their guilt in preventing adoption of UTF-8 and should and must be mentioned together.Spitzak (talk) 19:06, 26 March 2021 (UTC)Reply
Microsoft first developed its "Multi-Byte Character Set" APIs for Windows NT in the early 1990s, before UTF-8 had achieved much usage, and when Japanese Kanji character sets were more practically important than Unicode. UTF-8, if Microsoft programmers even knew about it at that time, would not have helped them deal with Shift_JIS or whatever... AnonMoos (talk) 22:51, 26 March 2021 (UTC)Reply
Microsoft was pretty far ahead of everybody in figuring out muiti-byte character encodings, and thus were in a better position to start using UTF-8. I was working for Sun and they were way behind and convinced that 16-bit characters were necessary, and they were even incapable of handling 8-bit non-ASCII in any intelligent way, often insisting on converting it to 3 octal digits. Microsoft really blew it when they decided to scrap all that work and use UCS-2. Some of this may have been misguided political correctness, there was certainly sentiment that Americans should not get the "better" 1-byte codes. The end result is that ASCII-only software still exists even today!Spitzak (talk) 23:46, 26 March 2021 (UTC)Reply
Microsoft was part of the initial alliance that launched Unicode, of course, but I find it difficult to imagine how it could have done a bunch of UTF-8 software implementation work in the early 1990s, which was then pulled out and replaced by 16-bit wide character interfaces. UTF-8 apparently didn't even exist until September 1992, at a time when Microsoft's MBCS people had to be focused mainly on making Japanese character sets work on the forthcoming Windows NT operating system (there was certainly more money to be made from that than from Unicode in 1992-1993). UTF-8 wasn't even introduced as a formal proposal until 1993, the same year that the first version of Windows NT was released, so the dates don't really seem to align... AnonMoos (talk) 16:14, 29 March 2021 (UTC)Reply
What I meant is that the multi-byte Japanese encodings you are talking about were much more similar to UTF-8 and should have provided a method of transitioning to it, and Microsoft was doing far more to support these transparently than others. Instead Microsoft abandoned all the progress they had done with multibyte encodings to try to use UCS-2 and we are all paying the price even today.Spitzak (talk) 19:03, 29 March 2021 (UTC)Reply
OK -- I'm skeptical as to whether Shift_JIS could prepare the way for UTF-8 in any practical or concretely-useful way, but now I understand what you're saying... AnonMoos (talk) 13:34, 31 March 2021 (UTC)Reply

Microsoft script dead link edit

   and Microsoft has a script for Windows 10, to enable it by default for its program Microsoft Notepad
   "Script How to set default encoding to UTF-8 for notepad by PowerShell". gallery.technet.microsoft.com. Retrieved 2018-01-30.
   https://gallery.technet.microsoft.com/scriptcenter/How-to-set-default-2d9669ae?ranMID=24542&ranEAID=TnL5HPStwNw&ranSiteID=TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w&tduid=(1f29517b2ebdfe80772bf649d4c144b1)(256380)(2459594)(TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w)()

This link is dead. How to fix it? — Preceding unsigned comment added by Un1Gfn (talkcontribs) 02:58, 5 April 2021 (UTC)Reply

That text, and that link, appears to have been removed, so there's no longer anything to fix. Guy Harris (talk) 23:43, 21 December 2023 (UTC)Reply

utf8 octal conversion edit

I think this section should be rewritten. It makes no sense to talk about bytes if you have triplets of octal numbers which make 9 bits in total, not 8. The grouping shown in the section is ambiguous (and wrong). --84.167.187.209 (talk) 02:24, 29 May 2021 (UTC)Reply

The table is correct, the results are 1 to 4 bytes, each displayed as 3 octal digits, the left-most digit cannot be greater than 3. If the bytes were somehow appended into a single octal number then you would first have an endieness question, and more importantly it would remove the alignment between the output octal digits and the input octal digits.
I have made this modification of the octal table. Do you understand it? x,y,z and w are octal digits.--BIL (talk) 22:11, 29 May 2021 (UTC)Reply
           
Octal code point <-> Octal UTF-8 conversion
First code point Last code point Code point Byte 1 Byte 2 Byte 3 Byte 4
000 177 xxx xxx
0200 3777 xxyy 3xx 2yy
04000 77777 xyyzz 34x 2yy 2zz
100000 177777 1xyyzz 35x 2yy 2zz
0200000 4177777 xyyzzww 36x 2yy 2zz 2ww
Yes that is a lot clearer.Spitzak (talk) 23:44, 29 May 2021 (UTC)Reply

I do agree there is a huge amount of bloat in this article, conversion from/to UTF-8 is actually really simple and I would love to see the majority of this text spew deleted.Spitzak (talk) 20:24, 29 May 2021 (UTC)Reply

Suggest/recommend throwing out the whole UTF-8#Octal section. I’m sure the intellectual exercise must have been “neat” or “kind of cool” to whoever took the time and effort to type it up and add it to the article, but IMHO it’s cruft like this that explains how this article got to be so long and bloated. I haven’t seen this appear in _any_ of the Unicode standards documents, and even the single reference cited admits that the API library just compares the binary, even if it might conceivably, theoretically be more convenient for a human with a scientific calculator converting hexadecimal to octal to compare bits manually. This article would IMHO be much more concise and more “encyclopedic” if the half of it comprising personal commentary/observations such as this section (which might be more appropriate, say, as a post on a personal blog, for example) were trimmed.  —PowerPCG5 (talk) 08:35, 10 November 2021 (UTC)Reply
Excellent idea. This section does not add useful information. −Woodstone (talk) 13:52, 10 November 2021 (UTC)Reply
Absolutely agree. About 3/4 of this article is bloated with trivial observations and/or redundant rewording of the same information over and over again. I did edit this table last, not because I liked it, but it was even larger and more intrusive before (they put it in as more columns in the other tables), and attempts to just remove it got reverted...Spitzak (talk) 15:10, 10 November 2021 (UTC)Reply
Such is the unfortunate nature of a community-built wiki - editors contribute to their own niche and hobbies. Criticism of Wikipedia#Systemic bias in coverage
Criticism of Wikipedia#Quality of writing is funny too. Wqwt (talk) 07:21, 4 September 2022 (UTC)Reply

Unfortunately there is an error in it. If first code point is 0200 then last code point can not be 3777, for example. Please consider description in main article of how Encoding process works, then you find that first Unicode code point is always 0 . Apparently that is how it is really done. — Preceding unsigned comment added by SiwardDeGroot (talkcontribs) 14:48, 21 July 2023 (UTC)Reply

You are talking about "overlong encodings". The code point 0 should be done by the one-byte entry in the first line of the table. Encoding code point 0 using the second line of the table is an error. Spitzak (talk) 15:45, 21 July 2023 (UTC)Reply

US-ASCII edit

@Comp.arch: With respect to Special:Diff/1105781113, it's better to use just "ASCII" unless it could be misinterpreted as some other variant of ISO 646 instead of ANSI X3.4-1986. It is not the case here, but I think current usage in the article is okay. IANA preference for "US-ASCII" only matters for use in the charset parameter or similar where "ASCII" is not even a valid label at all. Please don't link it halfway as US-ASCII because that makes absolutely no sense in any context and looks like a formatting mistake. Link the whole US-ASCII, piping it if you don't like the redirect. – MwGamera (talk) 12:49, 22 August 2022 (UTC)Reply