Talk:UTF-8/Archive 4

This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

"UTF-8 should be the default choice"

Latest comment: 8 years ago37 comments7 people in discussion

"It was also suggested that UTF-8 should be the default choice of encoding for all Unicode-compliant software."

http://www.theregister.co.uk/2013/10/04/verity_stob_unicode/ is a secondary source, published in a well-known technology magazine, that comments on the UTF-8 Everywhere publication, and there is no indication that it affiliates with the authors of the later. Please explain how it does not count as a valid source for the above claim. Thanks. 82.80.119.226 (talk) 17:00, 15 June 2015 (UTC)

It's a valid source for the above claim. Who cares about the claim? Lots of things have been suggested; for the most part, the suggestions are not interesting enough to note in Wikipedia.--Prosfilaes (talk) 17:30, 15 June 2015 (UTC)

Notability is not about being a "suggestion", a person or event, it is about importance. The manifesto clearly gained enough pagerank and endorsements by many respectful groups. It is a subject of on-going revolution and must definitely be referenced by Wikipedia to make Wikipedia better. In fact, I believe it even deserves an entry. I would say link the manifesto, but not TheRegister. 46.117.0.120 (talk) 15:36, 16 June 2015 (UTC)

What manifesto? "It was also suggested that UTF-8 should be the default choice of encoding for all Unicode-compliant software." says not a thing about any manifesto.--Prosfilaes (talk) 23:36, 16 June 2015 (UTC)

IMHO, the TheRegister article clearly refers to The Manifesto at http://www.utf8everywhere.org/. Exactly this masterpiece, in my opinion, should be referenced in the article. The only question is where and how. — Preceding unsigned comment added by 46.117.0.120 (talk) 00:47, 17 June 2015 (UTC)

The problem of referencing http://utf8everywhere.org/ directly is that it is a primary source which is not eligible to be referenced directly on its own. 82.80.119.226 (talk) 10:26, 17 June 2015 (UTC)

So add it not as a reference but as an external link. I am sure you guys understand more about the proper way to link it from Wikipedia than I do. There has to be a solution. If none is suggested, let's revert to the original proposed "it was suggested" formulation. — Preceding unsigned comment added by 46.117.0.120 (talk) 18:02, 17 June 2015 (UTC)

Many people have attempted to have Wikipedia reference http://www.utf8everywhere.org/ or the proposed boost extensions to make it possible to open files on Windows using UTF-8, but they always get reverted, there appears to be a contingent out there who do not want this opinion to ever be visible and come down fast and hard against anybody who attempts it. The fact that such an opinion exists and is shared by a large and rapidly-growing portion of the programming community is actually pretty important when discussing Unicode encodings and it should be mentioned in Wikipedia articles, but obviously somebody does not want that and is using the excuse that it is "opinion" to delete it every time. That said, I would prefer a link to utf8everywhere, the Reg article is pretty awful.Spitzak (talk) 18:42, 15 June 2015 (UTC)

I don't have a strong opinion about UTF8Everywhere. I have a strong opinion about "It was also suggested that ..." without saying who suggested it or why we should care that they suggested it.--Prosfilaes (talk) 23:36, 16 June 2015 (UTC)

Does "It is recommended by parts of the programming community that ..." sound better? This is a factual information, not opinion based, and is definitely relevant to the subject. Your argument of "why we should care that..." is faulty because it can be applied to any other piece of information virtually everywhere on Wikipedia. Why should we care about IMC recommendations? W3C recommendations? How community recommendations have less weight in this respect? 82.80.119.226 (talk) 10:26, 17 June 2015 (UTC)

"It is recommended by parts of the programming community that ..." Unix is being unreasonable for using more than 7 bits per character. (The Unix-Haters Handbook). That everyone should use UTF-16. That everyone should use ISO-2022. W3C recommendations are community recommendations, and from a known part of the community. I don't know why I should take UTF8Everywhere any more seriously then DEC64.--Prosfilaes (talk) 15:56, 17 June 2015 (UTC)

The Unix-Haters Handbook, UTF-16, ISO-2022, DEC64 -- if you have an indication that these are notable enough opinions (have secondary sources, etc...) then you are more than welcome to include these opinions in the respective articles. 82.80.119.226 (talk) 16:47, 17 June 2015 (UTC)

No, Wikipedia is NPOV, not positive POV. Even if we often do, we shouldn't have cites to pro-UTF-8 pages on UTF-8 and pro-UTF-16 page on UTF-16. To the extent that public support for UTF-8 over UTF-16 or vice versa is documented on either page, it should be consistent between pages.--Prosfilaes (talk) 19:26, 21 June 2015 (UTC)

I agree that it's even more important to link The Manifesto from the UTF-16 page. But, we are now discussing the UTF-8 page. 31.168.83.233 (talk) 14:21, 22 June 2015 (UTC)

I agree about 'suggested' and we can agree that there's no need for a holy war and no doubt about manifesto's notability. So, Prosfilaes, what formulation would you suggest? And where to link it from? I would leave the decision to you. Also whether the manifesto is good or bad or right or wrong is for sure irrelevant. I personally like it very much. 46.117.0.120 (talk) 14:02, 17 June 2015 (UTC)

The "manifesto" is not a bad piece, but it's one group's opinion and not a very compelling case for "UTF-8 everywhere". Each of the three principal encodings has strengths and weaknesses; one should feel free to use the one that best meets the purpose at hand. Since all three are supported by well-vetted open source libraries, one can safely choose at will. The manifesto's main point is that encoding UTF-16 is tricky; but it's no more tricky than UTF-8, and I've seen home-rolled code making mistakes with either. This really is a tempest in a tea-cup. -- Elphion (talk) 01:53, 17 June 2015 (UTC)

Whether it's compelling or not is not a criteria for the inclusion of the opinion. And as per manifesto's main point, to my interpretation it is that using one encoding everywhere simplifies everything, and then it gives a bunch of reasons why this encoding must be UTF-8. It also argues that UTF-16 is a historical mistake. 82.80.119.226 (talk) 10:26, 17 June 2015 (UTC)

What is the criteria for the inclusion of the opinion?--Prosfilaes (talk) 15:56, 17 June 2015 (UTC)

Notability. Just like anything else on Wikipedia. If you think that the one reference provided isn't enough, then say that, I'll find more. 82.80.119.226 (talk) 16:47, 17 June 2015 (UTC)

The danger of "free to choose" approach is addressed in FAQ#4 of the manifesto itself. This is also the reason for 'by default'. — Preceding unsigned comment added by 46.117.0.120 (talk) 00:47, 17 June 2015 (UTC)

You are repeating exactly what this manifesto argues against. The whole point of the manifesto is that the encodings do NOT have "equal strengths and weaknesses". It states that the encodings are NOT equal, and that any "weakness" of UTF-8 is shared by all the other encodings (ie variable-sized "characters" exist in all 3, even UTF-32), thus weakness(UTF-8) is less than weakness(others). Now you can claim the article is wrong, but that is an OPINION. You can't take your opinion and pretend it is a "fact" and then use it to suppress a document with a different opinion. The article is notable and is being referenced a lot. The only thing that would make sense is to link both it and a second article that has some argument about either a strength of non-UTF-8 that does not exist in UTF-8 or a weakness of UTF-8 that does not exist in non-UTF-8.Spitzak (talk) 17:18, 17 June 2015 (UTC)

I would like more references in this article.

I agree with Elphion that "You should always use UTF-8 everywhere" is not a neutral point of view. However, my understanding of the WP:YESPOV policy is that the proper way to deal with such opinions is to include all verifiable points of view which have sufficient weight.

I hear that something "is being referenced a lot". Are any of those places that reference it usable as a WP:SOURCE? If so, I think we should use those places as references in this article.

Are there places that say that maybe UTF-8 shouldn't always be used everywhere? Are any of those places usable as a WP:SOURCE? If so, I think that opposing opinion(s) should also be mentioned, using those places as references, in this article. --DavidCary (talk) 02:52, 18 June 2015 (UTC)

I have found only one. This: CONSIDERING UTF-16 BE HARMFUL BE CONSIDERED HARMFUL — Preceding unsigned comment added by 46.117.0.120 (talk) 04:38, 18 June 2015 (UTC)

unborked link: http://www.siao2.com/2012/04/27/10298345.aspx (SHOULD CONSIDERING UTF-16 BE HARMFUL BE CONSIDERED HARMFUL) -- Elphion (talk) 16:30, 18 June 2015 (UTC))

According to WP:SOURCE the place I referenced is usable as a source. My formulation is fine according to WP:YESPOV, and if Prosfilaes, or anyone else, does think that this makes the section biased in any way, they shall add a mention a WP:SOURCE conforming contradicting opinion, not to revert it. As such, can some experienced Wikipedian do that so I would not engage in edit wars with Prosfilaes? Obviously changing the formulation is also an option, but only alternatives I can currently think of are:

It was also suggested that...
It is recommended by parts of the programming community that...
The UTF-8 Everywhere manifesto published by Pavel Radzivilovsky et al. and supported by parts of the programming community, recommends that...

82.80.119.226 (talk) 10:48, 21 June 2015 (UTC)

I oppose this. There's simply no evidence this manifesto is notable. Actual use, declarations of intent by notable organizations, maybe something with a massive swell of public support, but not a manifesto by two people with a Register article mention.--Prosfilaes (talk) 19:26, 21 June 2015 (UTC)

I believe this http://www.reddit.com/r/programming/comments/sy5j0/the_utf8everywhere_manifesto/ and this https://news.ycombinator.com/item?id=3906253 is enough evidence for notability. There's more. — Preceding unsigned comment added by 31.168.83.233 (talk) 14:11, 22 June 2015 (UTC)

Neither of those sites offer any evidence for notability for anything.--Prosfilaes (talk) 22:55, 22 June 2015 (UTC)

I believe it does. The huge public debate IS notability. — Preceding unsigned comment added by 46.117.0.120 (talk) 15:40, 23 June 2015 (UTC)

You are just making up rules, and I doubt you have the right of doing this. According to WP:SOURCE it is notable enough. If you think I got something wrong, please cite the specific text.

Actual use? People have already added these comments to the article:

UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.

Possible Citations, I'm sure you could find lots more, but these are good samples showing that the low-level api is in UTF-8: [1] [2] [3] (look for set_title)

There are more: Boost.Locale, GDAL, SQLite (its narrow char interface is UTF-8 even on Windows, and the wide char isn't supported by the VFS), Go, Rust, D, Erlang, Ruby, and the list is going on...

82.80.119.226 (talk) 15:39, 22 June 2015 (UTC)

WP:WEASEL says that we don't say that "It is recommended by parts of the programming community that..." I don't specifically have cites for why we don't say "Don Smith says that UTF-8 is unnecessary and breaks half the programs on his system", but we don't, because everyone has an opinion.

The rest of that massively misses the point. I'm not arguing about whether UTF-8 is good. Yes, it's frequently used (though that first sentence is problematic; given the rise of Java-driven Androids and .NET and Windows, is it really increasing? What does iOS use?), and yes, we should list examples of major cases where it's being used. The question under hand is non-notable opinions about it.--Prosfilaes (talk) 22:55, 22 June 2015 (UTC)

The manifesto's FAQ #4 is a good example of what the authors get wrong. The question asks whether any encoding is allowable internally, and the authors respond (reasonably) that they have nothing against this. But then they argue that std::string is used for all kinds of different encodings, and we should just all agree on UTF-8. First, this doesn't answer the question, which in context is talking about UTF-8 vs UTF-16: nobody would use std::string (a sequence of bytes) for UTF-16 -- you would use std::wstring instead. Second, std::string has a precise meaning: it is a sequence of bytes, period. It knows nothing of Unicode, or of the UTF-8 encoding, or how to advance a character, or how to recognize illegal encodings, etc., etc. If you are processing a UTF-8 string, you should use a class that is UTF-8 aware -- one specifically designed for UTF-8 that does know about such things. Passing your UTF-8 string to the outside as std::string on the assumption that the interface will treat it as Unicode is just asking for trouble, and asking the world to adopt that assumption (despite the definition of std::string) is naive and will never come to pass. It will only foster religious wars. If you need to pass a UTF-8 string as std::string to the outside (to the OS or another library), test and document the interface between your UTF-8 string and the outside world for clarity. The manifesto's approach only muddies the waters.

You are utterly misunderstanding the manifesto, spouting out exactly what it is arguing against as though they are "facts", when in fact the whole paper is about how your "facts" are wrong. YES they are saying "use std::string ALWAYS" to store UTF-8 and non-UTF-8 and "maybe UTF-8" and also random streams of bytes. No, you DO NOT need a "UTF-8 aware container", contrary to your claim. Yes, you CAN "pass your UTF-8 string to the outside as a std::string" and it WILL work! Holy crap, the "hello world" program does that! Come on, please at least try to understand what is going on and what the argument is. I don't blame you, something about Unicode turns otherwise good programmers into absolute morons, so you are not alone. This manifesto is the best attempt by some people who actually "get it" to try to convince the boneheaded people who just cannot see how vastly simple this should be.Spitzak (talk) 20:41, 18 June 2015 (UTC)

Naturally, I disagree with your analysis of my understanding of the manifesto. But if you wish to continue this discussion, I suggest we move it to my talk page, as it contributes little to the question at hand. -- Elphion (talk) 03:07, 19 June 2015 (UTC)

Not every piece of code dealing with strings is actually involved in processing and validation of text. A file copy program which receives Unicode file names and passes them to file IO routines, would do just fine with a simple byte buffer. If you design a library that accepts strings, the simple, standard and lightweight std::string would do just fine. On the contrary, it would be a mistake to reinvent a new string class and force everyone through your peculiar interface. Of course, if one needs more than just passing strings around, he should then use appropriate text processing tools.31.168.83.233 (talk) 15:03, 22 June 2015 (UTC)

Yes, that's what I meant above by "processing" a string. If you're just passing something along, there's no need to to use specialized classes. But if you are actually tinkering with the UTF internals of a string, rolling your own code to do that on the fly is a likely source of errors and incompatibility. -- Elphion (talk) 20:00, 22 June 2015 (UTC)

In such case I suggest you remove the "FAQ #4" comment and the following argument. It is no longer relevant and the authors are doing okay. — Preceding unsigned comment added by 46.117.0.120 (talk) 22:51, 23 June 2015 (UTC)

The argument that UTF-8 is the "least weak" of all the encodings is silly. The differences in "weakness" are minuscule and mostly in the eyes of the beholder. As a professional programmer dealing with Unicode, you should know all of these and be prepared to deal with them. The important question instead is which encoding best suits your purposes as a programmer for the task at hand. As long as your end result is a valid Unicode string (in whatever encoding) and you communicate this to whatever interface you're sending it to, nobody should have cause for complaint. The interface may need to alter the string to conform to the expectations on the other side (Windows, e.g.). It is precisely the interfaces where expectations on both sides need to be spelled out. Leaving them to convention is the wrong approach.

I would say that all three major encodings being international standards endorsed by multiple actors is sufficient warrant for using them.

-- Elphion (talk) 18:24, 18 June 2015 (UTC)

Let's remove the sample code

Latest comment: 8 years ago12 comments4 people in discussion

I don't think it belongs in the article. sverdrup (talk) 12:11, 29 June 2015 (UTC)

Agreed. Mr. Swordfish (talk) 20:35, 9 July 2015 (UTC)

One response is not a "discussion". That code is what 50% of the visitors to this page are looking for. Instead I would prefer if the many many paragraphs that describe in excruciating detail over and over and over again how UTF-8 is encoded was deleteted, reduced to that obvious 7-line table that is at the start, and this code was included. Also I note nobody seems to want to delete the code with UTF-16.Spitzak (talk) 18:02, 13 July 2015 (UTC)

I didn't think it needed much discussion since it's a clear violation of Wikipedia policy. Wikipedia is not a how-to guide. Sample code is inappropriate. Unless you can provide some sort of reasoning why we should make an exception here, it should go.

I seriously doubt that anywhere near 50% of visitors or here to view C code. As for UTF-16, I've never had the occasion to look at that page, but if it includes sample code it should be removed too. Other stuff exists is not necessarily a valid argument. Mr. Swordfish (talk) 20:58, 13 July 2015 (UTC)

Code can be a very useful tool for explaining computer science matters, so I don't see it a clear violation of policy. Heapsort is a good example. I'm not sure C is the best language for code-based explanation, and I'm not sure that that code was of use to most of the visitors of the page.--Prosfilaes (talk) 23:26, 14 July 2015 (UTC)

The Heapsort article uses Pseudocode, which is an excellent way of providing an informal high-level description of the operating principle of a computer program or other algorithm. Since pseudocode is intended to be read by humans it is appropriate for an encyclopedic article. Actual code in a specific language - not so much, especially in this case since we are not describing an algorithm or program. Mr. Swordfish (talk) 14:19, 15 July 2015 (UTC)

Actual code in a specific language is intended to be read by humans, as well. Clear code in an actual language can be clearer then low-level pseudocode can, since the meaning is set in stone. In this case... we don't seem to be doing a good job of communicating how UTF-8 works for some people. I don't particularly believe the C code was helping, but I'm not sure deleting it helped make the subject more clear.--Prosfilaes (talk) 22:18, 15 July 2015 (UTC)

Fair enough. I now see that there are a lot of wikipedia pages with example code (https://en.wikipedia.org/wiki/Category:Articles_with_example_code) and I 'm not about to go on a crusade to remove them all. So I guess the question is whether that code example improves the article or not. I don't think it does, and don't really understand the point of including it in the article. Mr. Swordfish (talk) 15:49, 16 July 2015 (UTC)

If anything you say is true, then the excruciatingly detailed tables that repeat over and over and over again where the bits go (when the first 7-line table is by far the clearest) should be removed as well. They are simply attempts to describe the same program in english.Spitzak (talk) 00:39, 15 July 2015 (UTC)

I believe the code example helps A LOT and should be restored. The remaining text is nearly useless (except for the 7-line table that some people keep trying to obfuscate). The code actually describes how it works and makes it clear what patters are allowed and what ones are disallowed, it also shows what to do with invalid byte sequences (turn each byte into a code).Spitzak (talk) 17:32, 27 July 2015 (UTC)

I suppose we'll just have to disagree on its usefulness - I do not think it adds to the article in any meaningful way. Perhaps other editors can weigh in.

If we are going to restore it, we need to ask where it came from and how we know it is correct. I don't see any sources cited - either it was copied from somewhere ( with possible copyright implications) or an editor wrote it him or herself (original research). So, first we need to reach consensus that it improves the article, and then we'll need to attribute it to a reliable source. Mr. Swordfish (talk) 12:59, 29 July 2015 (UTC)

I wrote it. Some of it is based on earlier code I wrote for fltk. The code is in the public domain.Spitzak (talk) 17:52, 29 July 2015 (UTC)

Comment from lead

Latest comment: 8 years ago1 comment1 person in discussion

Move the following material to Talk (originally in the lead of the article and since August 2014 commented out)

UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.{{cn}}

Possible Citations, I'm sure you could find lots more, but these are good samples showing that the low-level api is in UTF-8:

https://developer.apple.com/library/mac/qa/qa1173/_index.html

https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text

http://wayland.freedesktop.org/docs/html/protocol-spec-interface-wl_shell_surface.html (look for set_title)

End moved material -- Elphion (talk) 12:35, 10 September 2015 (UTC)

Misleading caption

Latest comment: 8 years ago2 comments2 people in discussion

Graph indicates that UTF-8 (light blue) exceeded other main encodings of text on the Web, that by 2010 it was nearing 50% prevalent, and up to 85% by August 2015.[2]

While the 85% statement may be true, the graph doesn't indicate any such thing. NotYourFathersOldsmobile (talk) 22:13, 10 September 2015 (UTC)

Yes, attempts to update the graph to a newer version produced by the same engineer at Google have been reverted because he has not made clear what the copyright is on the new graph.Spitzak (talk) 23:26, 10 September 2015 (UTC)

Huge number of incorrect edits!

Latest comment: 8 years ago23 comments14 people in discussion

112.134.187.248 made a whole lot of mis-informed edits. With any luck he will look here. Sorry but it was almost impossible to fix without a complete revert, though he was correct exactly once when he used "will" instead of "could", I will try to keep that..

1. An indic code point cannot take 6 bytes. He is confusing things with CESU-8 I think, though even that will encode these as 3 bytes as they are still in the BMP.

2. You cannot pass ASCII to a function expecting UTF-16. Declaring the arguments to be byte arrays does not help, and I doubt any sane programmer would do that either. Therefore it is true that UTF-16 requires new apis.

3. Many attempts to claim there is some real-world chance of Asian script being larger in UTF-8 than UTF-16, despite obvious measurements of real documents that show this does not happen. Sorry you are wrong. Find a REAL document on-line that is larger. Stripping all the markup and newlines does not count.

4. Deletion of the fact that invalid byte sequences can be stored in UTF-8, by somehow pretending that they will magically not happen. Sorry, try actually programming before you believe such utter rubbish. Especially odd because some other edits indicate that he things invalid sequences will happen and are somehow a disadvantage of UTF-8.

5. Belief that markup tags using non-ASCII can remove the fact that it makes UTF-8 smaller. Sorry but markup contains far more slashes and angle brackets and spaces and quotes and lots and lots of ASCII-only tags so this will just not happen, no matter how much you wish it could.

6. Claim that invalid sequences of 4 bytes are somehow a problem, while ignoring invalid sequences in UTF-16 and all other invalid sequences in UTF-8. This is despite earlier edits where he basically pretends invalid sequences magically don't happen. Sorry you can't have it both ways.

7. Complete misunderstanding of why multibyte sequences in UTF-16 cause more problems than in UTF-8: because they are RARE. Believe me, NOBODY screws up UTF-8 by "confusing it with ASCII" because they will locate their mistake the very first time a non-ASCII character is used. That is the point of this advantage.

Spitzak (talk) 17:34, 19 October 2015 (UTC)

Includes response to Spitzak (talk)'s UTF-8 propaganda.

1: It was not an "Indic codepoint", an indic character. Characters are not codepoints, and, see TSCII. The letter "பு" takes 6 bytes in UTF-8 but only one in TSCII. Nice straw-man. "Sorry, try actually programming before you believe such utter rubbish." : Same goes for you, a character is not necessarily a codepoint. Try reading Unicode TRs and standards before you even attempt to program using encodings.
2: Well, UTF-8 requires new APIs too, to pass NUL as an argument.
3: Why don't you cite a reliable source? Anecdotes or WP:NOR. And SGML can still use non-ASCII for markup; there is no restriction on this.
4: Try to learn some English first; "Thinks". I programmed, too.
5: "Will just not happen" is normative, not positive. It is your opinion, and Wikipedia is not Facebook.
6: I can understand this. Sorry, I apologize.
7: Normative (opinion). If there is an error, I say that we spot it even with Latin, in UTF-16 if there is too much NUL or too less NUL. It depends on the programmer.

Still, UTF-16 can recover using surrogate points because there is a particular order of encoding surrogates, and they are paired; if a surrogate is missing, the other is just discarded and if both are there at right places, it could be used as the start pointer for recovery.

Invalid filenames: UTF-8 encoding says anything over 4 bytes as a code-point is illegal, and therefore it should be discarded; bugs are not to be considered features.

112.134.196.135 (talk) 06:32, 20 October 2015 (UTC)

"Deletion of the fact that invalid byte sequences can be stored in UTF-8": Invalid sequences can happen ANYWHERE; it is not an advantage for UTF-8.

Interspersed comments do not work at Wikipedia (consider what the page would look like if anyone wanted to reply to one of your points). Therefore I have extracted your responses and posted them above. I have not looked at the issue yet, but the tone of the above responses is what would be expected from a belligerent rather than someone wanting to help build the encyclopedia. Johnuniq (talk) 06:51, 20 October 2015 (UTC)

Thanks for editing, Johnuniq . The edits by Spitzak includes a lot of normative content; haven't you noticed his tone? I am trying to build a good encyclopedia, with less opinion-based information and good neutrality.

1: The number of bytes that a character takes in any Unicode format is unbounded, though I think Unicode permits some limit on the number of combining characters to be stacked on one base character. That TSCII encodes as one codepoint what Unicode does as two is not really relevant to this page.

2: If you want to include NUL in a string, you need an API that will let you do it. If you don't care, you can use NUL-terminated strings. There's absolutely no difference between ASCII and UTF-8 in this regard, and many people use NUL terminated UTF-8 strings.

3: Cites are important where there's a real dispute, but this is citable, I believe, and in practice it is true. I don't see why SGML, which is a generic form of markup instead of markup itself, is relevant here; of course a markup language in theory could use non-ASCII characters, but I've never seen it done, and HTML, TeX and other common forms of ASCII markup are probably a million times more common then any non-ASCII markup.

5: Bah. It's measurable. You can demand cites, but don't demean other posts by dismissing facts as their opinion.--Prosfilaes (talk) 15:38, 20 October 2015 (UTC)

1. It is still true that TSCII is a multi-byte (or single-byte on how you view it) encoding, and still deals with this problem. Even though the 2CP problem is due to Unicode itself, it still applies because it is a multi-byte encoding to encode the same characters; just more efficient for a certain script.

2: Yes, they do use null-terminated strings, I can understand that these are used in argv[]. But still, null-terminated strings are what house a lot of other bugs; UTF-8 does not expose them first, but UTF-16 forces them to care about NUL problems.

3: Yes, SGML places no restrictions on this, HTML does; this is just what popular document-tools do. However, when the Chinese content is higher than the markup tags, it is space-advantagerous for UTF-16. I am not saying that this does not happen, but what the author says 'often' is something that needs citation on 'how often'; the citation provided was anecdotal. For example, for a non-fancy document, or some document that properly separates documents, scripts and stylesheets. 'Sometimes' is a better word.

5. WP:NOR still applies.

To the point 7: People who treat UTF-16 as UCS-2 are similar to people who treat UTF-8 as ASCII. Which editors happen to be the majority? Editors who treat UTF-8 as ASCII, I can say from anecdotes. Right now, rarely any editors have these problems; Blaming UTF-16 for UCS-2 is like blaming UTF-8 for Latin; the blame-game is based on obsolete encoding assumptions. Who says "NOBODY"? Weasel?

How exactly is saving the illegal characters an advantage over UTF-16? If it is a stream, both UTF-8 and UTF-16 do them. Both these standards tell to discard invalid codepoints; you may claim that anything beyond 4-bytes after transformation can still be ignored, but any other invalid surrogates can be ignored in UTF-16 as well. I am the same person, again.112.134.200.102 (talk) 18:39, 20 October 2015 (UTC)

1: The 2CP problem? Sounds like OR to me. Sure as heck doesn't come up in a web search.

2: NUL terminated strings are just fine for many problem; they do not "house a lot of bugs".

3: So your OR wins over a citation? SGML is irrelevant here; there are many forms of markup, and none of the popular ones use characters outside of ASCII.--Prosfilaes (talk) 02:05, 21 October 2015 (UTC)

1. 2CP problem: 2 code-point per character problem. It is not an OR; it is a problem mentioned above; it is very relevant in 'specific multi-byte encodings' or 'specific single-byte encoding'. It was already mentioned here, but someone said it's irrelevant; it is still an encoding, well-capable of encoding that " two code points" in single byte.

3: So as UTF-16 is fine for me using non-NUL-terminated strings. What's the problem here? Null-terminated strings still house a lot of bugs.^[1]^[2]. If programmer's ignorance is not to be cited, then remove the thing about UCS-2, and, all implementation-related bugs/faults as well. Edit: Sorry, someone else removed it, or, I removed it; it is no longer there; this point is done. The point was that programmer's ignorance about mistaking UTF-16 for UCS-2 is not a weakness of UTF-16 per se, any more than the weakness of UTF-8 getting mistaken for some Latin-xxxx especially when we have specific BOMs.

3: My claim was no OR; the author's claim was the OR. It is up to the claimant to prove something; I am not saying that most popular markups do not use ASCII; It is up to author to prove (or at least cite from a reliable source) that the size of markup is larger than the size of other language text, for him to use the word " often" here. "Sometimes" is a better word. Moreover, I can still say that usually, a web page is generated on request; they do not store HTML files there, just text that will be inserted there, in a database, with very little markup. It is not whether it is my OR or not; it is just not giving an anecdote as a citation to show how often this happens, even if you think SGML is irrelevant.112.134.234.218 (talk) 05:27, 21 October 2015 (UTC)

I removed the one about UTF-8 encoding invalid strings; some UTF-16 processor which can still process invalid sequences can either ignore invalid sequences. It is quite the same as UTF-16; you either have this advantage both ways or neither way. Moreover, Unix-filenames/paths are encoding-unaware; they still treat the file paths/names as ASCII; only having one NUL is a problem, so as having a NUL is not a problem in length-prefixed filenames or other length-aware methods or methods like double NUL. If what he means by invalid filename is invalid character sequence, any encoding-unaware stream processor can do it (either one that does not discard anything after 4 bytes or after 2nd ordered surrogate), based on internal limitations like NUL or NULNUL or other ones caused by pre-encoding the stream length.

Removed the sentence blaming UTF-16 for inconsistent behaviour with invalid strings, while not explaining about the consistency of invalid sequences in invalid UTF-8 streams (the over-4-byte problem). If there is a standard behaviour that makes UTF-8 more consistent with invalid sequences than UTF-16, please mention the standard.112.134.188.165 (talk) 16:11, 21 October 2015 (UTC)

Unix filenames do not treat the name like ASCII; they treat it as an arbitrary byte string that does not include 00 or 2F. Perfect for storing UTF-8 (which was designed for it), impossible for storing UTF-16. Your links to complaints about NUL-terminated strings are unconvincing and irrelevant here.--Prosfilaes (talk) 08:17, 25 October 2015 (UTC)

Prosfilaes: Yes, UNIX just does them as arbitrary NUL-terminated strings without 00/2F. So as Windows handles it differently. These are based on invalid limitations. My point is that 'possibility to store invalid strings' exists in both of them; that's not a UNIX-only bug/feature. He responded like encoding a broken UTF-8 string is a UTF-8-only feature while you can have broken streams pretty much in any place where (1) it does not violate internal limitations and (2) it is encoding-unaware. I do not have any problem with NUL terminated files per se, but I don't see why should a broken bytestream considered feature in one and not a feature in the other.175.157.233.124 (talk) 15:43, 25 October 2015 (UTC)

Cleaned up the article due to WP:WEASEL ("... suggested that...").

Cleaned up the article, again WP:WEASEL ("...it is considered...") by whom? Weasels?175.157.125.211 (talk) 07:48, 22 October 2015 (UTC)

112.134.190.56 (talk) 21:12, 21 October 2015 (UTC)

Cleaned up: "If any partial character is removed the corruption is always recognizable." Recognizable how? UTF-8 is not an integrity-check mechanism; if one byte is missing or altered, it will be processed AS-IS. However, the next byte is processable, and it is already listed there.175.157.94.167 (talk) 09:18, 22 October 2015 (UTC)

The point you mentioned, '4. Deletion of the fact that invalid byte sequences can be stored in UTF-8, by somehow pretending that they will magically not happen. Sorry, try actually programming before you believe such utter rubbish. Especially odd because some other edits indicate that he things invalid sequences will happen and are somehow a disadvantage of UTF-8.'

I did not delete it. Please, be respectful to others; do not accuse others falsely. Invalid sequences can be stored in anything, especially in encoding-unaware or buggy implementations. I could see where this is coming from, currently, I am from Sri Lanka; am not them, who are probably from somewhere else.

Spitzak: See your edit here: https://en.wikipedia.org/w/index.php?title=UTF-16&type=revision&diff=686526776&oldid=686499664 . Javascript is NOT Java, and the content you removed says nothing implying failure to support UTF-16; it is just that the distinction was made so that surrogate pairs are counted separately than code points. In fact, it's I who added the word 'Javascript' first to that article.

I removed the implication that counting code units somehow makes it "not support UTF-16". Whether you meant it or not, the wording sounded like "Javascript tries but counts surrogate pairs as 2 and this is wrong". Also, a paragraph at the end points out that *ALL* the languages using UTF-16 work this way, so Javascript is not special.Spitzak (talk) 06:13, 2 November 2015 (UTC)

Please refrain from using phrases like "any sane programmer", "NOBODY screws" which are examples of No true Scotsman fallacy.175.157.94.167 (talk) 10:07, 22 October 2015 (UTC)

For edits like 1, please see WP:AGF.

Prosfilaes: I clarified your point in the article; If you think it is too much for an encyclopedia, please feel free remove it. Thanks :) 175.157.233.124 (talk) 15:54, 25 October 2015 (UTC)

Prosfilaes: The article (Indic-specific) TSCII exists in Wikipedia; that's why I stated it; it works as I mentioned (it is in the table in the article) and it is capable of encoding one-byte for two-codepoints for one letter. The standard is here: https://www.iana.org/assignments/charset-reg/TSCII . I am sorry if it is considered OR or rare.

UTF-8 does not require the arguments to be nul-terminated. They can be passed with a length in code units. And UTF-16 can be nul-terminated as well (with a 16-bit zero character), or it can also be passed with a length in code units. Please stop confusing nul termination. The reason an 8-bit api can be used for both ASCII and UTF-8 is because sizeof(*pointer) is the same for both arguments. For many languages, including C, you cannot pass an ASCII array and a UTF-16 array as the same argument, because sizeof(*pointer) is 1 for ASCII and 2 for UTF-16. This is why all structures and apis have to be duplicated.

Everybody seems to be completely unable to comprehend how vital the ability to handle invalid sequences is. The problem is that many systems (the most obvious are Unix and Windows filenames) do not prevent invalid sequences. This means that unless your software can handle invalid sequences and pass them to the underlying system, you cannot access some possible files! For example you cannot make a program that will rename files with invalid names to have valid names. Far more important is the ability to defer error detection to much later when it can be handled cleanly. Therefore you need a method to take a fully arbitrary sequence of code units and store it in your internal encoding. I hope it is obvious that the easiest way to do this with invalid UTF-8 is to keep it as an array of code units. What is less obvious is that using standard encodings you can translate invalid UTF-16 to "invalid" UTF-8 by using the obvious 3-byte encoding of any erroneous surrogate half. This means that UTF-8 can hold both invalid Unix and invalid Windows filenames. The opposite is not true unless you divert in seriously ways away from how the vast majority of UTF-8/UTF-16 translators work by defining some really odd sequences as valid/invalid.

The idea that SGML can be shorter if you just use enough tags that contain Chinese letters is ridiculous. I guess if the *only* markup was <tag/> (ie no arguments and as short as possible with only 3 ascii characters per markup), and if every single tag used contained only Chinese letters, and they averaged to greater than 3 characters per tag, then the markup would be shorter in UTF-16. I hope you can see how totally stupid any such suggestion is.

I and others mentioned an easy way for you to refute the size argument: point to an actual document on-line that is shorter in UTF-16 verses UTF-8. Should be easy, go do it. Claiming "OR" for others inability to show any examples is really low. Spitzak (talk) 06:13, 2 November 2015 (UTC)

Spitzak:

For god's sake, I never said that UTF-8 requires NUL-termination; I just said that it is possible to pass a NUL-less UTF-8 string around with traditional NUL-terminated char arrays in legacy code. In that case, any ASCII-encodable encoding can do NUL termination. Please keep in mind that you are editing English Wikipedia. Single-byte NUL-terminators are a UNIX-specific internal limitation, so as others have their own.

"I hope you can see how totally stupid any such suggestion is.": Whatever it is, regardless of that, the problem is that it is an un-cited non-neutral claim, which is against the policies of a general Wikipedia article^[3]. Don't claim "most" or the frequency if you cannot cite it anyway, from a large number of samples. The current citation is currently good enough only as an Existential quantifier.

"Claiming "OR" for others inability to show any examples is really low.": Wikipedia is not Facebook or any other general social media to shout opinions or anecdotal evidence without verifiability. The burden of proof lies on the claimant, not someone else to disprove; so, it needed citation anyway from a large study. I am not AGAINST this, but it needed citation. burden to demonstrate verifiability lies with the editor who adds or restores material, WP:VERIFIABILITY says.

As a side note: if phrases like "is ridiculous" is acceptable here and to you, see how many edits of you have been ridiculous because you misunderstood the English there; I just pointed them out nicely. Remember, it does not take some extreme exceptional skill to become not nice. "I removed the implication that counting code units somehow makes it "not support UTF-16". Whether you meant it or not, the wording sounded like "Javascript tries but counts surrogate pairs as 2 and this is wrong". Also, a paragraph at the end points out that *ALL* the languages using UTF-16 work this way, so JavaScript is not special.": I don't see how did it imply that JavaScript does not support UTF-16. Supporting UTF-16 means supporting UTF-16, regardless of whether it is wrong way or right way. UTF-16 is an encoding standard. You could have chosen to remove just the part that offended (while it did not, anyway). If you are unsure and had a problem with repetition on that page, there is a Wikipedia's way of doing that using templates, too, it looks like this: ^{[needs copy edit]}. There is an entire essay regarding this, worth reading: WP:TEARDOWN. The word 'JavaScript' wasn't even there until I added it, there is no purpose in arguing whether it is special or not. JavaScript is not Java and deserves a separate mention in that section. " An indic code point cannot take 6 bytes. He is confusing things with CESU-8 I think...": It was not a codepoint, it was a character. And you should see how TSCII works, to know how those 6 bytes are simplified into one. Now, I did not call them ridiculous. 112.135.97.96 (talk) 14:47, 5 November 2015 (UTC)

Surprisingly, WP:WEASEL there, too: "... leads some people to claim that UTF-16 is not supported". Who? Can't anyone just remove them instead of accusing me? 112.135.97.96 (talk) 15:10, 5 November 2015 (UTC)

"Everybody seems to be completely unable to comprehend how vital the ...": This is what we mean by (not-fully)-encoding-unaware. Filenames work at a very low level; it is just that Windows uses wchar and UNIX uses just bytes. This is just limited by internal limitations of both - a UNIX filename cannot contain a 00. Why do you keep saying that 'the obvious way'? UTF-8 is not any more obvious than UTF-16. "standard encodings you can translate invalid UTF-16 to "invalid" UTF-8 by ": It isn't standard if it is invalid. I kind of get your point (like encoding higher codepoints), but this is too much OR and needs sources. Remember that standard UTF-32 can encode even more codepoints than the 6-byte UTF-8 can do (32 bits vs 31 bits), twice as much. Good luck encoding 0xFFFFFFFE in a six-byte UTF8 stream. 112.134.149.92 (talk) 06:39, 6 November 2015 (UTC)

Spitzak: https://en.wikipedia.org/w/index.php?title=UTF-16&type=revision&diff=688641878&oldid=688640790

This simplifies searches a great deal: This is just not encyclopedic. Simplifies search according to whom? Wikipedia strives to be a high-quality encyclopedia with reliable references.112.135.97.96 (talk) 15:10, 5 November 2015 (UTC)

Spitzak: I am kind of satisfied with your last edit on this page, though, and the current status of it. Thanks.112.135.97.96 (talk) 15:53, 5 November 2015 (UTC)

175.157.213.232 (talk) 11:56, 6 November 2015 (UTC) 175.157.213.232 (talk) 12:00, 6 November 2015 (UTC) 175.157.213.232 (talk) 12:48, 6 November 2015 (UTC)

References

Modified UTF-8

Latest comment: 8 years ago8 comments3 people in discussion

In this section it is stated:

"Modified UTF-8 uses the 2-byte overlong encoding of U+0000 (the NUL character), 11000000 10000000 (hex C0 80), rather than 00000000 (hex 00). This allows the byte 00 to be used as a string terminator."

It ides not explain why a single byte 00000000 can not be used as a string terminator. Given that all other bytes commencing 0 are treated as ASCII, why should 00 be any different?

FreeFlow99 (talk) 11:53, 8 September 2015 (UTC)

If you use a single null byte (00) to indicate the end of a string, then whenever you encounter 00 you have to decide whether it indicates an actual character (U+0000) in the string or the end of the string itself. The point of the modified scheme is that 00 would never appear in character data, even if the data included U+0000. This has two advantages: (1) 00 does not code characters, so it is freed up for local use as the traditional metacharacter to indicate unambiguously the end of a string, and (2) 00 never appears in encoded data, so that software that chokes on 00 or ignores 00 can still handle data that includes U+0000. The downside, of course, is that you have to avoid software that chokes on the overlong sequence. -- Elphion (talk) 13:34, 8 September 2015 (UTC)

Thanks for your reply. However I'm not sure I completely understand the merits. If we are talking about a text only file, 00x should never be in a string (paper tape being obsolete), [unless it is used to pad a fixed length field, in which case I understand the point]. What you say would be a problem for data files because bytes aren't just used to store characters but can store values such as integers, reals etc, and these can conflict with control characters; but this problem is not limited to 00x, it applies to all control codes. The only reason I can see to treat NUL separately is that 00x is very common. To avoid ambiguity completely we would need to limit control code usages to text only files, and for data files use a structure eg where each component has an id and a size followed by the actual data, or use some different system with a special bit that is out of reach of any possible data values (conceptually a 'ninth' bit) that flags the 'byte' as a control character. Or have I missed something? FreeFlow99 (talk) 14:54, 8 September 2015 (UTC)

Your statement "if we are talking about a text only file, 00x should never be in a string" is the misunderstanding. U+0000 is a perfectly valid Unicode character. Unicode does not define its meaning beyond its correspondence with ASCII NUL, and applications use it in a variety of ways. It is in fact common in real data. You should therefore expect character data to contain it, and be prepared to handle it (and preserve it as data). Modified UTF-8 is one way of doing that without giving up the notion of 00 as string terminator. The alternative pretty much requires that string data be accompanied by its length (in bytes or characters) -- which is not a bad idea. -- Elphion (talk) 16:33, 8 September 2015 (UTC)

Why should UTF-8 encoded data use U+0000 as valid data? UTF-8 is meant to encode Unicode text, and then U+0000 is a ASCII NUL, not a character. If you want to encode binary data in e.g. XML files, it would be better to use base64, not UTF-8 which has worse size factor between input and output, and should also not be used to encode U+DC00, U+FFFE etc.--BIL (talk) 22:09, 5 November 2015 (UTC)

The simple answer: show me a Unicode Consortium reference saying that U+0000 is not a character. It's a valid codepoint; you don't get to dictate how others use it. It is widely used in database fields and as a separator, in what is otherwise text data, not binary data. -- Elphion (talk) 23:27, 5 November 2015 (UTC)

A question: is overlong NUL supported by other encodings such as UTF-16? I think that long UTF-16 is possible only for codepoints above U+FFFF. If not then UTF-8 and UTF-16 are not compatible with each other.--BIL (talk) 10:06, 20 November 2015 (UTC)

I think what you're asking is whether there are two distinct byte sequences (whether valid or not) that could be interpreted as U+0000 using the UTF-16 encoding rule, and the answer (as you suggest) is no. This does not make UTF-8 and UTF-16 incompatible. In all three encodings (UTF-8, Modified UTF-8, UTF-16) there is only one valid way to encode U+0000: respectively 00, C080, 0000. None of these encodings supports an alternative encoding of U+0000, i.e., a different valid sequence that decodes to U+0000. -- Elphion (talk) 13:49, 20 November 2015 (UTC)

Leaving an interesting link here

Latest comment: 8 years ago1 comment1 person in discussion

Generating tables similar to the ones in UTF-8#Examples —suzukaze (t・c) 06:21, 25 February 2016 (UTC)

cef vs ces confusion

Latest comment: 7 years ago11 comments5 people in discussion

arguing against 'undid revision 715814058'...'not an improvement':

unicode character encoding form: utf-8, utf-16, utf-32.

unicode simple character encoding scheme: utf-8, utf-16be, utf-16le, utf-32be, utf-32le.

unicode compound character encoding scheme: utf-16, utf-32.

"avoid the complications of endianness and byte order marks in the alternative utf-16 and utf-32 encodings"

character encoding forms doesn't have endianness and byte order marks, so this is about character encoding schemes.

mentioning one simple character encoding scheme and ignoring all others is misleading; it's trying to compare the three character encoding forms in character encoding scheme context.

this confusion is because name ambiguity (e.g. utf-32 cef corresponds to utf-32 ces, utf-32be ces and utf-32le ces; utf-8 cef only corresponds to utf-8 ces (big endian)).

revision 715814058 mentions all simple character encoding schemes in places discussing character encoding scheme information (endianness, byte order marks, backward compatibility). 177.157.17.126 (talk) 23:18, 29 April 2016 (UTC)

The terms "simple ces" and "compound ces" just introduce more confusion, and obscure the points being made. UTF-8 attempts to avoid issues of BOM and endianness; any form of UTF-16 or UTF-32 should communicate the byte order, either via BOM or by specifying the endianness in the communication protocol. That's really all that needs to be said. Elphion (talk) 02:19, 30 April 2016 (UTC)

ok, but utf-8 specify the endianness in the communication protocol to the same extent any other simple ces.

one simple ces (utf-8) versus all other ces (utf-16be, utf-16le, utf-32be, utf-32le, utf-16, utf-32) is a misleading grouping, sounds like utf-8 is the only simple ces.

what about removing the term "compound character encoding schemes" from revision 715814058 but retaining the grouping simple ces vs compound ces?

like "It was designed for backward compatibility with ASCII and (like UTF-16BE, UTF-32BE, UTF-16LE and UTF-32LE) avoids the complication of byte order marks in UTF-16 and UTF-32." 177.157.17.126 (talk) 01:37, 1 May 2016 (UTC)

No, UTF-8 has no endianess, as it is simply a byte stream. By contrast, with any of the UTF-16 or UTF-32 protocols, you have to determine the endianess (whether through the protocol specification, BOM, or direct inspection) to combine the bytes into numbers appropriately. Users don't care whether the UTF-xx they're getting is simple or complex, they just want to know the endianess of the code units. I suspect your concern is to suggest that specifying the endianess in the communication protocol should be preferred because it removes ambiguity; but it would be better to say that directly rather than to rely on the technical terms, which are not well known or understood. (And in any event, the right place to make that suggestion would be in UTF-16 and UTF-32, not here.) This is all further complicated by the fact that in the real world you often find unnecessary BOMs in simple ces's and missing ones in complex ces's. So you can't count on the presence or absence of BOMs; the issue really isn't BOMs at all, it's endianess, which is what really matters. A robust program will "trust but verify" the endianess specified in the protocol anyway, since mistakes do happen (with some frequency). -- Elphion (talk) 02:22, 1 May 2016 (UTC)

my concern is that wording misleads people in thinking "UTF-8 has no endianess"; any multi-byte encoding has endianess, you still need to choose between sending the most significant byte first or the least significant byte first in a byte stream.

utf-8 is big-endian by protocol specification, just like utf-16be and utf-32be.

a robust program can verify if utf-8 wasn't sent with the least significant byte first too; probably nobody does that, so it isn't a issue, but it would be better to say that directly rather than to rely on misleading wording. (this is not about design anyway)

in utf-8 each code unit is a byte, so each code unit doesn't have endianess; this information is useless for most purposes; users don't care the endianess of isolated code units, they care about the endianess of the byte stream. 177.157.17.126 (talk) 07:31, 1 May 2016 (UTC)

I have to disagree with that. Users do care about endianness of what you called ‘isolated code units,’ because this is the only place the concept of endianness makes any sense at all. If you insist on saying that UTF-8 is big-endian because bytes carrying information about higher bits of codepoint value come first in stream before those determining the lower ones, then, by the same logic, UTF-16LE is also big-endian because high surrogates come before low ones (or more accurately PDP-endian, but definitely not little!). Thinking that ‘UTF-8 has no endianness’ is perfectly reasonable; no byte-swapped variant of it ever existed. — mwgamera (talk) 09:07, 1 May 2016 (UTC)

in utf-16le:

bytes in code units are little-endian.

code units in code points are big-endian.

bytes in code points are middle-endian (little-big-endian to be unambiguous).

in utf-8:

bytes in code units are endian neutral.

code units in code points are big-endian.

bytes in code points are big-endian (neutral-big-endian).

in utf-32be:

bytes in code units are big-endian.

code units in code points are endian neutral.

bytes in code points are big-endian (big-neutral-endian).

yes, you need to know utf-16le is little-big-endian (and variable-width) to implement it.

note utf nomenclature omit some details, so it isn't utf-8nbe, utf-16lbe, utf-16bbe, utf-32lne and utf-32bne.

'no byte-swapped variant of utf-8 ever existed' is historical, not about design; it would cause the same endianness issues of other encodings if existed. 179.181.84.241 (talk) 02:25, 3 May 2016 (UTC)

Yes. Characterizing UTF-8 as big-endian is a misuse of the term. Endianess refers specifically to the process of grouping bytes to form larger (16-bit, 32-bit, 64-bit) numerical values, as in the UTF-16 and -32 coding units. (By extension, it can also refer to aggregating, say, 16-bit words into 32-bit words.) It does not refer to extracting data from bytes via a predefined protocol, as in UTF-8. Once you start addressing the internal structure of the bytes (or words), you're no longer dealing with endianess. Similarly, UTF-16-LE is little-endian because the bytes forming the numerical values of the 16-bit code units are presented least-significant first. The fact that the high surrogate comes first is immaterial, because you have to deal with the bit structure of the surrogates to assemble them into a numerical codepoint value -- that's not an "endian" process, but governed by a separate protocol. -- Elphion (talk) 18:51, 1 May 2016 (UTC)

i see, utf-8 "avoid the complications of endianness" by using a separate protocol that is more complicated than endianness.

i consider endianness an abstract concept that can be applied to all ordered data; e.g. date formats.

using your narrower definition of endianness we need another term, "stripped order": the data order stripping (ignoring) variable-width (or other) metadata.

utf-32le is little-endian, utf-8 is big-stripped-order, utf-16le is little-endian-big-stripped-order.

stripped order has all issues of endianness and more for metadata handling; so "UTF-8 has no endianess" is correct, but misleading (unless you add stripped order information). 179.181.84.241 (talk) 02:25, 3 May 2016 (UTC)

" 'avoid the complications of endianness' by using a separate protocol that is more complicated than endianness."
Yes, that is exactly right. Errors due to mismatch in numerical byte order between software and hardware were so common that a term (endianess) was coined to describe the phenomenon, and over time people learned to pay attention to the problem. One approach is to communicate information on the transfer protocol. Another is to add structure to the data so that the information becomes self-documenting, and therefore less reliant on the accuracy of the transfer description. The latter is the apporach of UTF-8. There is a trade-off: computational cost of the extra structure, versus the added resilience against error. This is a fundamental tension in all digital processes (including things like DNA).

You're free to generalize endianess to describe things it was never meant to address, but statements like "utf-32le is little-endian, utf-8 is big-stripped-order, utf-16le is little-endian-big-stripped-order", and similar suggestions farther up, are (in my opinion) confusing rather than enlightening.

-- Elphion (talk) 14:27, 3 May 2016 (UTC)

Indeed, Danny Cohen's famous Internet Engineering Note 137, "On Holy Wars and a Plea for Peace", turned 36 years old last month. His attempt to stop the big- vs. little-endian war of the ARPANET and early Internet was somewhat less successful than Gulliver's attempt to stop the Lilliput vs. Blefuscu war. RossPatterson (talk) 01:52, 6 May 2016 (UTC)

where's the source referred to in #WTF-8?

Latest comment: 7 years ago3 comments3 people in discussion

Section WTF-8 says "The source code samples above work this way, for instance." – I can't find these samples, where are they? --Unhammer (talk) 06:25, 24 September 2015 (UTC)

It's gone as discussed above, so I removed the sentence. Someone should probably add a note about WTF-8 being jocularly used to mean a kind of encoding error before Rust people hijacked it. — mwgamera (talk) 23:09, 27 October 2015 (UTC)

Added. I knew of the encoding error meaning (=mojibake), but was unaware of this newer meaning until today. Dsalt (talk) 15:26, 21 June 2016 (UTC)

Question about 'codepage layout' table

Latest comment: 7 years ago3 comments3 people in discussion

The legend states: Red cells must never appear in a valid UTF-8 sequence. The first two (C0 and C1) could only be used for an invalid "overlong encoding" of ASCII characters (i.e., trying to encode a 7-bit ASCII value between 0 and 127 using two bytes instead of one; see below). The remaining red cells indicate start bytes of sequences that could only encode numbers larger than the 0x10FFFF limit of Unicode. Why are cells F8-FB (for invalid 5-byte sequences) in a separate, darker red colour? They don't seem conceptually distinct from F5-F7 or FC-FF. 74.12.94.137 (talk) 11:03, 12 May 2016 (UTC)

I think it is trying to show alternating light/dark for the different code lengths.Spitzak (talk) 19:30, 12 May 2016 (UTC)

It does appear to be that. (However, black text on dark backgrounds is not a good idea.) Dsalt (talk) 19:39, 21 June 2016 (UTC)

Does Notepad ever produce a UTF-8 file?

Latest comment: 7 years ago5 comments4 people in discussion

The article contains the sentence "Many Windows programs (including Windows Notepad) add the bytes 0xEF, 0xBB, 0xBF at the start of any document saved as UTF-8." Yet when I dumped a UTF-8 file which had been edited by someone on a Windows machine I found that it was actually UTF-16. Every other byte was null. There is no purpose in putting a BOM on a UTF-8 file and although many Windows users talk about "UTF-8 with BOM", is it not the case that all the files they are talking about are actually encoded in UTF-16? I don't have a Windows machine past Vista but looking up the Windows API function IsTextUnicode() I found the remarkable comment with the test IS_TEXT_UNICODE_ODD_LENGTH: "The number of characters in the string is odd. A string of odd length cannot (by definition) be Unicode text." So for the Windows API (today), Unicode and UTF-16 are synonymous. Is it therefore correct to change the sentence above to reflect this? Chris55 (talk) 14:42, 30 June 2016 (UTC)

Huh, answered my own question. On Notepad under Vista (but updated more recently) when I did a save on a UTF-8 file as it proffered UTF-8 as the default to save under. So either the person who sent me the file had a very old copy or he didn't know the difference. The BOM is a nuisance and not recommended by anyone but all the options they offered, including the never-standardized ANSI appeared to work ok. Unicode is just the name they use for UTF-16. Chris55 (talk) 16:42, 30 June 2016 (UTC)

I'm pretty certain there is a method to save as UTF-8. Note that lots of Windows documentation uses "Unicode" when what it really should say is "UTF-16" (or maybe "UCS-2"). To Windows programmers a UTF-8 file is *not* "Unicode" since it is not "UTF-16". This is an endless source of annoying confusion.Spitzak (talk) 16:44, 30 June 2016 (UTC)

Another wrinkle: not sure off-hand which programs/Win-version do this, but some will save an edited file with BOM in UTF-8 if the original source had BOM (in whatever encoding). Annoying confusion indeed. -- Elphion (talk) 19:59, 30 June 2016 (UTC)

It is true that the Windows terminology is based on old circumstances and is sometimes not really correct. Notepad has four encodings to choose between when saving a file: "ANSI", "Unicode", "UTF-16BE" and "UTF-8". Where "ANSI" is Windows-1252 which has never been an ANSI standard, "Unicode" is UTF-16LE even if UTF-16BE is preferred by the Unicode standard over UTF-16LE. All these except ANSI place a BOM first.--BIL (talk) 20:47, 30 June 2016 (UTC)

No mention of what UTF stands for.

Latest comment: 7 years ago2 comments2 people in discussion

In the article, there was no mention that UTF stands for Unicode Transformation Format^[1]. This could either go at the start of the article in parenthesis after the bolded "UTF-8" or here.

--Quasi Quantum x (talk) 19:24, 2 August 2016 (UTC)

There was already a mention in the lead, which I've slightly adjusted. Nitpicking polish (talk) 20:00, 2 August 2016 (UTC)

References

^ "FAQ - UTF-8, UTF-16, UTF-32 & BOM". unicode.org. Unicode, Inc. Retrieved 2 August 2016.

Page is misleadingly organized

Latest comment: 7 years ago3 comments3 people in discussion

A page about UTF-8 should start by giving the definition of UTF-8, not by exploring the whole history of how it developed. History should come later. The current organization will cause many people to misunderstand UTF-8.

-- Andrew Myers (talk) 23:05, 21 February 2017 (UTC)

I agree that the description should be before the history, and that the history can be placed after Derivatives.--BIL (talk) 20:04, 22 February 2017 (UTC)

Unfortunately that way the tables are out of chronological order. — RFST (talk) 06:42, 15 March 2017 (UTC)

External links modified

Latest comment: 7 years ago1 comment1 person in discussion

Hello fellow Wikipedians,

I have just modified 2 external links on UTF-8. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://web.archive.org/web/20140723192908/http://bsittler.livejournal.com/10381.html to http://bsittler.livejournal.com/10381.html
Corrected formatting/usage for http://plan9.bell-labs.com/sys/doc/utf.pdf

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 18:56, 5 April 2017 (UTC)

email addresses

Latest comment: 7 years ago2 comments2 people in discussion

Why do so many email addresses have "UTF-8" in them?

For example: "=?utf-8?Q?FAIR?= <fair@fair.org>"

Dagme (talk) 06:09, 21 April 2017 (UTC)

=?utf-8?Q?FAIR?= is not the email address itself, but a name connected to it. UTF-8 enables names using letters outside the English or Western European alphabets to be used. If the name is shown as =?utf-8?Q?FAIR?= then the receiving email application can't decode the name format. There is always a period when some software does not support the standards on the net. The actual email address is fair@fair.org, and as far as I know UTF-8 or anything outside ASCII is not permitted (according to the article email address, efforts are being made to allow such addresses, but they will be incompatible unless every router and receiver supports such addresses). Your email at least reach its destination.--BIL (talk) 18:37, 21 April 2017 (UTC)

Backward compatability

Latest comment: 6 years ago7 comments3 people in discussion

I am trying to fix text that seems to be confused about how the backward compatability with ASCII works. I thought it was useful to mention extended ASCII but that seems to have produced much more confusion, including a whole introduction of the fact that UTF-8 is very easy to detect and that invalid sequences can be substituted with other encodings, something I have been trying to point out over and over and over, but really is not a major feature of UTF-8 and does not belong here.

The underlying statement is that a program that looks at text, copying it to an output, that only specially-treats some ASCII characters, will "work" with UTF-8. This is because it will see some odd sequences of bytes with the high bit set, and maybe "think" that is a bunch of letters in ISO-8859-1 or whatever, but since it does not treat any of those letters specially, it will copy them unchanged to the output, thus preserving the UTF-8 encoded characters. The simplest example is printf, which looks for a '%' in the format string, and copies every other byte to the output without change (it will also copy the bytes unchanged from a "%s" string so that string can contain UTF-8). Printf works perfectly with UTF-8 because of this and does not need any awareness of encoding added to it. Other obvious example is slashes in filenames.

My initial change was to revert some text that implied that a program that "ignores" bytes with the high bit set would work. The exact meaning of "ignore" is hard to say but I was worried that this implied that the bytes were thrown away, which is completely useless as all non-ASCII text would vanish without a trace.

Code that actually thinks the text is ISO-8859-1 and does something special with those values will screw up with UTF-8. For instance a function that capitalizes all the text and thinks it knows how to capitalize ISO-8859-1 will completely destroy the UTF-8. However there is a LOT of functions that don't do this, for instance a lot of "capitalization" only works for the ASCII letters and thus works with UTF-8 (where "work" means "it did not destroy the data").

Would appreciate if anybody can figure out a shorter wording that everybody agrees on and clearly says this. The current thing is really bloated and misleading.Spitzak (talk) 18:59, 6 June 2017 (UTC)

First, your notion that you are some kind of Unicode god and that everybody else is a mere mortal and confused is a little aggravating to say the least. But thanks for at least posting something on the Talk page before reverting willy-nilly like you usually do.

The original text that you reverted did not state that code which ignored 8-bit characters would work. It said that an ASCII processor which ignored 8-bit characters would see all and only the ASCII characters in a UTF-8 byte stream, in the correct order. This is a true. The ASCII content in an UTF-8 can be separated from the non-ASCII content without any decoding. The bytes with the high bit set are a separate non-ASCII stream, so to speak. There aren't any 7-bit characters hidden inside the multi-byte 8-bit sequences because overlong sequences are not allowed. Whether it is acceptable to separate the 7-bit stream from the 8-bit is of course dependent on the situation. When you ignore some of the input, you are, you know, ignoring some of the input. Passing the 8-bit characters through uninterpreted, which you think is "the point", may also work depending on the purpose of the processing and whether there is some further processing that knows what to do. It won't be useful in all cases, such as if the processing is the end of the line, for the purpose of textual display or page rendering for example. That is where fallback/replacement comes in. Anyway, when I unreverted the original, I removed this whole point not because it is false, but because it is apparently confusing, and there are better and more succinct ways to get at the point. As for fallback and autodetection, you commented it out because it is "important" but not "salient". Come on. I'm changing the intro sentence to read "important" features rather than "salient", so you'll be happy. Now we have a list of important features, including fallback and autodectection, rather than "salient" features. We have one thing in the list which is merely "important" but doesn't rise to the level of "salient". But I'll think we'll live. Sheesh. Person54 (talk) 19:24, 6 June 2017 (UTC)

I also added your point about printf, but I am not sure this is actually correct. One of the features of printf format strings is a field width indication. e.g "%8s", meaning a string field which occupies 8 character width units on the display or page. It isn't clear that printf would substitute UTF-8 strings correctly into such width-specified fields, without decoding the UTF-8 so as to know how many characters they represent, and how many spaces are needed for padding. So the printf example may not be a good example of processing which can treat the UTF-8 input as if it were ASCII. Person54 (talk) 20:00, 6 June 2017 (UTC)

New one looks good, though still pretty lengthy. I do believe the fact that there are many invalid sequences which thus allow UTF-8 to be mixed with legacy encodings without the need for a BOM or other indicator is important, so I guess it is good that it is here, though I really doubt that was considered when UTF-8 was originally designed.

"characters" is pretty ill-defined in Unicode and can have many different answers, so number of code units is certainly a better definition of the string width as it has a definite meaning, and also because it is the value of offsets returned by searching UTF-8. I suppose printf could be made "UTF-8 aware" by truncating the printed string at the start of the code point that exceeds this limit but I am not very certain this is an improvement.Spitzak (talk) 16:14, 7 June 2017 (UTC)

A better example of a way to make printf UTF-8 aware is that snprintf truncates at a given number of code units. It may make sense for it to truncate at the start of a code point. Since it returns the number of code units printed this is less of a problem than having %ns do unexpected actions.Spitzak (talk) 16:22, 7 June 2017 (UTC)

I fudged the printf example so that what is in the article is correct, if a little vague; but it still might not be the best example of the point. I agree it is sort of long-winded. Other folks have been swooping in and making my contributions more succinct, and these are improvements, which I was happy to see.

As for what was originally intended, UTF-8 was designed literally on the back of a napkin in a diner by Ken Thompson and then coded up by him in a couple of days. So it wouldn't be surprising if he didn't think of everything, and UTF-8 turned out even better than he imagined. On the other, we're talking about Ken Thompson, so maybe we should not underestimate what he foresaw. Person54 (talk) 16:32, 7 June 2017 (UTC)

We kind of discussed this in 2008; see Archive 1 and User_talk:JustOnlyJohn... AnonMoos (talk) 13:29, 7 June 2017 (UTC)

External links modified (January 2018)

Latest comment: 6 years ago1 comment1 person in discussion

Hello fellow Wikipedians,

I have just modified one external link on UTF-8. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://web.archive.org/web/20071026103104/http://www.imc.org/mail-i18n.html to http://www.imc.org/mail-i18n.html

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 01:57, 23 January 2018 (UTC)

Count of code points

Latest comment: 5 years ago3 comments2 people in discussion

I made an edit yesterday, in which I added to the table at the top of the secttion "Description" a column with the number of code points in each "number of bytes". But the information I entered there was wrong because it didn't take into account invalid code points, and fortunately the edit was quickly reverted by Spitzak. He said that I shouldn't do this because "it will lead to an endless argument about whether invalid code points count". But I wonder if we could do something a bit different that would avoid that problem and still provide helpful information.

I see that the article starts out with the statement that there are 1,112,064 valid code points, which consist of the 1,114,112 total of my proposed column, less 2,048 invalid ones. The validity rules are fairly complex and I haven't seen them summarized by saying which of those 2,048 invalid code points have two, three, or four bytes. Is there a reference with that information? If so, we could add to the table in the article the three right-hand columns of the following table, or keep this as a separate table if adding the new columns would make the table in the article too wide:

Number of bytes	Code points
Number of bytes	Total	Invalid	Valid
1	128	0	128
2	1,920	???	???
3	63,488	???	???
4	1,048,576	???	???
Total:	1,114,112	2,048	1,112,064

If the number 2,048 currently in the article is correct, then this informaiton should be uncontroversial. But we need to know how many of those 2,048 invalid code points belong to each number of bytes. Does someone have a reference with that information?

Ennex2 (talk) 01:44, 7 November 2018 (UTC)

The 2048 are all encoded using 3 bytes.

Some people think the noncharacters should not be counted too and those are all over the place.

I'm more concerned that bloating the table with all these extra columns are making UTF-8 look more complicated than it really is. Would like to get rid of the number of bits and "how many bytes", or at least move them *after* the ranges so it is more clear that the ranges are what identify the rows.

Spitzak (talk) 02:24, 7 November 2018 (UTC)

I see from the history of the page that you've been heavily involved in its development for years, whereas I'm a newcomer to the subject (though not a newcomer to computing). I'm glad you're working on clarifying things because I agree that it's pretty complicated. If there's a way to class code points in categories like "invalid", "noncharacters", "valid", and any others that may apply, that could help people understand what's going on. With such categories, if there's a way to summarize them by counting how many there are within each number of bytes, such as in a table like I've proposed, that could help to clarify things further. But I defer to you on when enough is known to present such information.

Ennex2 (talk) 02:45, 7 November 2018 (UTC)

"Mandatory"

Latest comment: 5 years ago1 comment1 person in discussion

The current article is too free with the term "mandatory". Yes, WHATWG calls it mandatory for WHATWG projects. It is not mandatory in a legal sense, or even a standards sense. The references to WHATWG sources need to make that clear. -- Elphion (talk) 20:03, 15 November 2018 (UTC)

UTF-1

Latest comment: 5 years ago8 comments5 people in discussion

The table recently added to describe UTF-1 is interesting, but it should go in UTF-1, not here. Here it suffices to say what the first paragraph says, together with the link to UTF-1. -- Elphion (talk) 15:24, 23 July 2016 (UTC)

I formatted the table as much as possible in the style of the others (which closely resemble the table in the FSS-UTF specification), specifically so that they could be considered together (for better appreciation of the history). It's not as if it's taking up that much space relative to the huge table further down in the article, and why would there not be a table to illustrate the text (or should that also have to be moved?)... I had indeed considered what you are pointing out, but that even seemed less preferable than (redundantly) having the table in both places. Perhaps let's wait for others' opinions about this? — RFST (talk) 07:26, 24 July 2016 (UTC)

Agree with ELphion -- this is much more info about UTF-1 than is needed in the UTF-8 article. It's also confusing that the first table in the UTF-8 article is not about UTF-8 at all... AnonMoos (talk) 07:44, 7 September 2016 (UTC)

UTF-1 is interesting trivia, I think we could rewrite it entirely out of this page (or almost) or maybe move History-section lower, if the MOS allows. This page is about UFT-8, not UTF-1, possibly [also] UTF-1 can be a small footnote in Unicode article. -- 14:24, 7 September 2016 Comp.arch

I see no reason to delete the UTF-1 article, if that's what you're suggesting. We have tons of articles about obsolete encodings, and UTF-1 is historically an important step. -- Elphion (talk) 19:49, 7 September 2016 (UTC)

2017

I think that the part about UTF-1 is essential context, especially given the story about how UTF-8 was invented by one superhuman genius (in a diner, on a placemat...). How can you appreciate what problem was being solved, what the respective contributions were, without a juxtaposition of the various stages? If you don't care about what UTF-1 looked like, then you also don't really care about the history of UTF-8. Without it, you might as well delete the whole section. — RFST (talk) 14:25, 1 March 2017 (UTC)

It might not be out of place to briefly mention some aspects in which UTF-1 was lacking, in order to indicate the problems which UTF-8 was attempting to solve, but any detailed explication or analysis of UTF-1 would be out of place. AnonMoos (talk) 04:27, 2 March 2017 (UTC)

I believe Ken Thompson invented UTF-8. It was called UTF-2 back then. But UTF-1 found two very important concepts that Thompson has nothing to do with. Multi-byte encoding and backward compatibility with ASCII. Off topic, C programming language is actually just B with types, and B was created by Ken Thompson. Still we consider C as Dennis Ritchie's invention, so obviously this is how things work. Coderet (talk) 13:40, 1 August 2017 (UTC)

I'm not sure what you mean by "found two very important concepts". UTF-1 was neither the first multi-byte encoding, nor the first encoding to have compatibility with ASCII; ISO 2022 predates it.--Prosfilaes (talk) 00:29, 16 November 2018 (UTC)

Citation needed on "truncate in middle of a character"

Latest comment: 4 years ago3 comments3 people in discussion

In the section "Comparison with single-byte encodings", a bullet point mentions an obvious fact:

"It is possible in UTF-8 (or any other multi-byte encoding) to split or truncate a string in the middle of a character", but this is flagged with [citation needed]".

This is certainly true in languages such as C where strings are stored as an array of bytes. So I will add that proviso and remove the citation needed. — Preceding unsigned comment added by 138.207.235.63 (talk • contribs) 02:45, 4 August 2019 (UTC)

No, if the string is not an array of bytes, it is *NOT* UTF-8 (unless you are talking about some weird scheme where it is split into tiny 1-4 byte UTF-8 strings at the code points????). Therefore you have just added a lot of pointless bloat to the article, which I am going to revert.Spitzak (talk) 18:49, 4 August 2019 (UTC)

138.207. and @Spitzak: watch the WP: Signatures policy, both of you.
By the way, I removed from the article two instances of older applications that can… (or cannot…) gibberish. Please, write articles based on facts, not advocacy. Incnis Mrsi (talk) 07:19, 5 August 2019 (UTC)

UTF-8, the lead section, and my hidden agenda

Latest comment: 3 years ago4 comments3 people in discussion

Hi Ita140188, I see you've moved important information out of the lead, including the graph.

At a minimum, I find it important to have the graph above the fold, in the lead. Per MOS:LEAD: "It gives the basics in a nutshell and cultivates interest in reading on—though not by teasing the reader or hinting at what follows."

94.5% shows clear majority, while showing that 5% do use something else (could be 430 million people), without going into details, so "what are those 5% using" people might think? There is space to go into more details, show the rest is very divided. If you do not know that, then you would be excused to thinking 1/20 use some ONE other encoding.

The missing 5% is roughly equal to the populations of Russia, Japan, Egypt and Thailand combined, all with their good reasons to avoid Unicode, and for all we know all those countries (and no other) could be avoiding UTF-8 for their old legacy encodings not covering what Unicode/UTF-8 can cover (but all of those countries have high UTF-8 use, and no country has UTF-8 use much lower than 90% on the web).

I checked on mobile, and there you have to press to expand other sections. On my wide-screen desktop monitor, there is empty space that could well be filled with the graph.

Most people are not going to scroll past the lead. And I would argue, the MOST important information about an encoding is that it is used, or not used, and what are the alternatives.

You may not be aware of the UTF8 Everywhere Manifesto: "In particular, we believe that the very popular UTF-16 encoding (often mistakenly referred to as ‘widechar’ or simply ‘Unicode’ in the Windows world) has no place [except]". comp.arch (talk) 19:01, 6 May 2020 (UTC)

I think that is way too much detail for the lead, but ok. In any case, there must be a section detailing the use, since I think it is the most likely information people are looking for when opening the article. If they are like me, they would not read the lead but just jump to the relevant section from the table of contents. Also, the lead cannot have independent content: it should only be a summary of the article. This means any information in the lead should also be present elsewhere in the article. --Ita140188 (talk) 01:56, 7 May 2020 (UTC)

I made the section because some other editor duplicated the entire paragraph and inserted it in the middle of the article as a new section. I thought this might make them happy and avoid the duplicate. It does seem that this could be put back into the intro.Spitzak (talk) 02:18, 7 May 2020 (UTC)

I originally made the section, and my point is that it should stay. We can discuss what to add to the lead from that section. --Ita140188 (talk) 02:52, 7 May 2020 (UTC)

Byte order mark trivia

Latest comment: 3 years ago3 comments2 people in discussion

This article has seen significant work recently to try to elevate the important aspects of the subject and reduce the amount of coverage on trivia. One such change has been reverted on the grounds that "BOM killing usage of UTF-8 is very well documented". Of course the material in question has been unreferenced ever since it was added. I don't dispute that people using e.g. Windows Notepad in the middle of the decade were very annoyed by this, but it truly isn't an important enough aspect of the subject today to warrant its own subheading. All that we need to do is note that historically some software insisted on adding BOMs to UTF-8 files and that this caused interoperability issues, with a good reference. We currently lack the latter entirely, but we should at least restore the reduced version of the content such that we aren't inflating what is basically a historic bug that has no impact on the vast majority of uses of the spec. Chris Cunningham (user:thumperward) (talk) 17:08, 4 September 2020 (UTC)

Probably not very important nowadays, and continued watering-down by people trying to whitewash bad behavior by certain companies is making it unreadable. The problem was not actually programs adding the BOM, it was software that refused to recognize UTF-8 without the BOM, which *forced* software to write it and destroyed the ASCII-compatibilty, as well as basically introducing magic bytes to a file format that is intended to be the most generic text with no structure at all, and complicates even the most trivial operations such as concatenating files. I agree that an awful lot of software has been patched to ignore a leading BOM and the only real bad result is the programming time wasted making these modifications. It actually appears that now there is an inverse problem and some Microsoft compilers work better with UTF-8 if the BOM is *missing*, the reason is that they leave the bytes in quoted string constants alone, while if the BOM is there they perform a translation to UTF-16 and back again, which introduces a lot of annoyances such as mangling any invalid byte sequences.

My main concern with the article here though was to move the description of the BOM out of the "description" section, since it is strongly discouraged by the Unicode consortium and a thing that should not exist has no right to be in the introductory description. It could be reduced a lot further. I also don't think there is much software that will show legacy letters any more.Spitzak (talk) 18:20, 4 September 2020 (UTC)

If you're accusing me of somehow having some pro-Microsoft agenda then I'd encourage you to go and have a walk or pet a dog or something. My only concern here is making the article as accessible as possible, which means minimising the amount of material in it which exists primarily to air editors' grudges against historic implementation bugs.

This material is still unsourced, and warrants a paragraph at best (and no subheader). It should be obvious to any reader that there is no actual need for a marker indicating byte order in a single-byte encoding, and without any (referenced!) context which shows this is a true and notable problem (as opposed to a historic quibble) then the reader is left wondering why the hell such a big deal is being made of it. Chris Cunningham (user:thumperward) (talk) 18:30, 4 September 2020 (UTC)

Runes

Latest comment: 3 years ago5 comments4 people in discussion

The Google-developed programming language Go defines a datatype called rune. A rune is "an int32 containing a Unicode character of 1,2,3, or 4 bytes". It is not clear from the documentation whether a rune contains the Unicode character number (code point) or the usual UTF-8 encoding used in Go. Testing reveals that a rune appears to be the Unicode character number.

I found a good reference to confirm this at https://blog.golang.org/strings, so this information should be added prominently to this article and similar articles that are missing it. It can be quite frustrating to read about runes in Go and not have this information. David Spector (talk) 00:42, 4 September 2020 (UTC)

It sounds like that belongs in the page for Go, not here. This is about UTF-8, not datatypes specific to one language. Tarl N. (discuss) 01:26, 4 September 2020 (UTC)

Furthermore, this isn't a software reference manual. It shouldn't be added here at all, let alone "prominently", precisely because it is an obscure implementation feature of a relatively new programming language. Chris Cunningham (user:thumperward) (talk) 17:10, 4 September 2020 (UTC)

I believe the Plan9 documentation also called unicode code points "runes" so it might be relevant here, though really does not sound very important.Spitzak (talk) 18:12, 4 September 2020 (UTC)

Given the shared heritage of all three systems it's unsurprising that they share idiosyncrasies in nomenclature, but this is probably something more pertinent to the biographies of the creators than to the individual systems. Chris Cunningham (user:thumperward) (talk) 18:32, 4 September 2020 (UTC)

[1] ttps://www.jwz.org/blog/2014/04/heartbleed-hit-list/

[2] ttp://queue.acm.org/detail.cfm?id=2010365

[3] ttps://en.wikipedia.org/wiki/Wikipedia:Verifiability

[4] "FAQ - UTF-8, UTF-16, UTF-32 & BOM". unicode.org. Unicode, Inc. Retrieved 2 August 2016.

[1]

[2]

[3]

[1]