Talk:UTF-8/Archive 1

Latest comment: 14 years ago by BenRG in topic Size Comparisons

Reason for RFC3629 restricting sequence length

I've tried to find a good reason why RFC3629 restricts the allowed encoding area from 0x0-0x7FFFFFFF down to 0x0-0x10FFFF, but haven't found anything. It would be nice to have an explanation about why this was done. Was it only to be UTF-16 compatible? -- sunny256 2004-12-06 04:07Z

i haven't seen any specific reference to why in specifications but i think what you mention is the most likely reason. I do wonder if this restriction will last though or if they will have to find a way to add more code points at a later date (they originally thought that 65535 code points would be plenty ;) ). Plugwash 10:31, 6 Dec 2004 (UTC)
The reason is indeed UTF-16 compatibility. Originally Unicode was meant to be fixed width. After the Unicode/ISO10646 harmonization, Unicode was meant to encode the basic multilingual plane only, and was equivalent to ISO's UCS-2. The ISO10646 repertoire was larger from the beginning, but no way of encoding it in 16 bit units was provided. After UTF-16 was defined and standardized, future character assignments were restricted to the first 16+ planes of the 31-bit ISO repertoire, which are reachable with UTF-16. Consequently security reasons prompted the abandonment of non-minimum length UTF-8 encodings, and around the same time valid codes were restricted to at most four bytes, instead of the original six capable of encoding the whole theoretical ISO repertoire. Decoy 14:05, 11 Apr 2005 (UTC)
The precise references for the above are [1] and [2]. The latter indicates that WG2 on the ISO side has formally recognized the restriction. Furthermore, [3] and [4] indicate that the US has formally requested WG2 to commit to never allocating anything beyond the UTF-16 barrier in ISO10646, and that the restriction will indeed be included in a future addendum to the ISO standard. Decoy 16:51, 19 Apr 2005 (UTC)
For the avoidance of confusion, I suggest that the main UTF-8 article should have an extra paragraph indicating that the encoding allows for six octets in theory, but that no more than four are ever permitted. References such as [5] will then be less likely to cause uncertainty. —Preceding unsigned comment added by RickBeton (talkcontribs) 13:57, 11 October 2008 (UTC)
It already does:
"By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the universal character set). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003." Plugwash (talk) 15:40, 11 October 2008 (UTC)

invalid UTF-8 binary streams

In the description, the table indicates that 1100000x (C0, C1) are invalid binary strings because the value (xyz) is less than 127. Is this really correct? Don't the two byte character points start at 0x80? Also, while I think it's useful to point out the pre-RFC-3629 validation, combining them all in the same table may be confusing. 1111110x was valid in 2002 but invalid in 2004. If my understanding is correct (quite possibly not), the table might be more easily understood as:

Codes (binary) Notes
0xxxxxxx valid
10xxxxxx valid
110xxxxx valid
1110xxxx valid
1111010x valid
1111011x invalid
11111xxx invalid

Perhaps the previous method considered hex values more clearly (personally, I find the binary imperative to understanding UTF-8). Alexgenaud 2007-03-06

Yes C0 and C1 are invalid because they would be the lead bytes of an overlong form of a character from 0 to 63 or 64 to 127 respectively. As for the other codes that are now forbidden it is important to remember this is essentially a change from being reserved for future expansion to being forbidden, that is theese values should not have been in use anyway. Plugwash 16:40, 6 March 2007 (UTC)
Thanks. rfc3629 doesn't define 'overlong'. The single byte contributes 7 bits to the character point while two bytes contribute 11. If the leading 4 bits of the 11 are zero then we are left with 7 bits. The stream octets are still unique, but the character point would be redundant (though the RFC implies this will naively or magically become +0000 for reasons beyond me). I suppose there must be other similar issues: 11100000 1000000x 10xxxxxx , 11100000 100xxxxx 10xxxxxx and perhaps three others in the four octet range. Alexgenaud 2007-03-06 (octets seperated by plugwash)
Hope you don't mind me spacing out the octets in your last post There are invalid combinations in the higher ranges but no more byte values that are always invalid. RFC 3629 doesn't define overlong but it does use the term in one place and talks about decoding of overlong forms by nieve decoders. It does state clearly that they are invalid though it does not use the term when doing so ("It is important to note that the rows of the table are mutually exclusive, i.e., there is only one valid way to encode a given character."). Plugwash 13:24, 8 March 2007 (UTC)

decimal VS hex

the article currently uses a mixture of decimal and hex with no clear labeling as to which is which. Unicode specs seem to prefer hex. HTML entities and the windows entry method default to decimal. ideas? Plugwash 02:19, 26 Jan 2005 (UTC)

Unicode characters are usually given in hexadecimal (e.g., U+03BC). But numeric character references in HTML are better in decimal (e.g., μ rather than μ), because this works on a wider range of browsers. So I think we should stick to hexadecimal except for HTML numeric character references. --Zundark 09:15, 26 Jan 2005 (UTC)
I agree, and would add that when referring to UTF-8 sequence values / bit patterns / fixed-bit-width numeric values, it is more common to use hexadecimal values, either preceded by "0x" (C notation) or followed by " hex" or " (hex)". I prefer the C notation. I doubt there are very many people in the audience for this article who are unfamiliar with it. For example, bit pattern 11111111 is written as 0xFF. -- mjb 18:25, 26 Jan 2005 (UTC)
Could someone add a paragraph to the main text about decimal or hex numeric character references to explain as why they are necessary?? —The preceding unsigned comment was added by 203.218.79.170 (talkcontribs) 2005-03-25 01:19:52 (UTC).

Appropriate image?

Is Image:Lack of support for Korean script.png an appropriate image for the "invalid input" section? I am wondering if those squares with numbers are specifically UTF-8? --Oldak Quill 06:26, 23 July 2005 (UTC)

No, the hexadecimal numbers in the boxes are Unicode code points for which the computer has no glyph to display. They don't indicate the encoding of the page. The encoding could be UTF-16, UTF-8, or a legacy encoding. Indefatigable 14:47, 23 July 2005 (UTC)

Comment about UTF-8

The article reads: "It uses groups of bytes to represent the Unicode standard for the alphabets of many of the world's languages. UTF-8 is especially useful for transmission over 8-bit Electronic Mail systems."

"UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. The day after, Pike and Thompson implemented it and updated their Plan 9 operating system to use it throughout."

RANTMODE: ON

I would like to add that UTF-8 is causing havoc and confusion to computer users that are used to the international characters in at least the ISO-8859-1 charset. Getting national chars through is a hit or miss thing these days. Some people can read UTF-8, but not write it. Some people write it but can't read it, some people can do neither, and some people do both.

UTF-8, in my opinion, sucks. I've used months to finally figure out how to get rid of the abomination. I'm finally back to ISO-8859-1 on all my own systems. I *hate* modern OS-distributions defaulting to UTF-8, and I've got a dream about meeting the bastard that invented the abomination to tell him what I really think of it.

RANTMODE: OFF (unsigned comment by 80.111.231.66)

In my opinion UTF-8 is wonderful encoding and I hope that it will be widely accepted as soon as possible. For example, the Czech language may use the CP852 encoding, the ISO8859-2 encoding, the Kamenicky encoding, Windows-1250 encoding, theoretically even KOI-8CS. Every stupid producer of a character LCD display creates his own "encoding". It doesn't work in DVD players, it creates problems on web pages, in computers, you cannot even imagine how it sucks. UTF-8 could become THE STANDARD we are looking for, because it is definitely the best standardized Unicode encoding and it is widely accepted (even Notepad in WinXP recognizes it, so we can say that it has become a de-facto standard). What really makes one angry is the "Java UTF-?", "MAC UTF-?", byte-order-dependent UTF-16 "encoding" and so on. Everybody tries to make his "proprietary extensions" and the main reason is to keep incompatibility and to kill the interoperability of software - because some people think it to be "good for bussiness". So PLEASE, ACCEPT THE UTF-8, it really is the best character encoding nowadays. (2006-08-17 20:38)
working with systems that use different encodings and provide no good way to specify which one is in use is going to be painfull PERIOD. The old system of different charsets for different languages can be just as horrible. Plugwash 16:37, 13 August 2005 (UTC)
The point is that most people don't need to work with systems with different language encodings. While almost everyone is affected by the conflicts between UTF-8 and their local charset. When you're not used to that kind of bullshit, it gets hugely annoying. Personally I've just spent the day de-utf8ying my systems. (80.111.231.66)
Point taken, but you should know that we feel the same way when dealing with your "code pages" and other limited charsets. Some time in the future, we might say the same thing about UTF-8, though at least they've left us two extra bytes to add characters to. Rōnin 22:48, 15 October 2005 (UTC)
Wikipedia is utf-8. Anyone who thinks that Wikipedia would have been as easy to manage without one unified character set is smoking something. 'nuff said. --Alvestrand 00:15, 18 August 2006 (UTC)

ISO-8859-1 has no € (euro sign) so it is no longer useful in a world where the euro is the most powerful currency. 84.59.194.36 (talk) 18:43, 27 August 2008 (UTC)

Never mind U+20AC works fine. You can code it as €, looking like: € 1.23. -- 20:55, 27 August 2008 User:Woodstone
"ISO-8859-1 has no € (euro sign) so it is no longer useful in a world where the euro is the most powerful currency." that is true but afaict when most internet software says ISO-8859-1 it means windows-1252 which does have the euro sign. Plugwash (talk) 23:42, 27 August 2008 (UTC)

The comparison table

Is the table wrong for the row showing "000800–00FFFF"? I don't know much about UTF-8 but to me it should read "1110xxxx 10xxxxxx 10xxxxxx" not "1110xxxx xxxxxx xxxxxx" --220.240.5.113 04:30, 14 September 2005 (UTC)

  • Fixed. It got lost in some recent table reformatting. Bovlb 05:44, 14 September 2005 (UTC)

bom

the following was recently added to the article i reverted and was re-reverted ;).

  • UTF-16 needs a Byte Order Mark (U+FEFF) in the beginning of the stream to identify the byte order. This is not necessary in UTF-8, as the sequences always start with the most significant byte on every platform.
UTF-16 only needs a bom if you don't know the endianess ahead of time because the protocol was badly designed not to specify it, but such apps that don't have an out of band way of specifying charset already probbablly aren't going to be very suitable for use with utf-16 anyway. Apart from plain text files i don't see many situations where a bom would be needed and it should be noted that with plain utf-8 text at least ms uses a bom anyway. In sumarry while i can see where you are coming from i'm not sure how relavent this is given whats already been said. Plugwash 22:39, 6 October 2005 (UTC)
It was recently pointed out to me that using a UTF-8 BOM is the only way to force web browsers to interpret a document as UTF-8 when they've been set to ignore declared encodings. I haven't confirmed this, myself, but it sounds plausible. — mjb 00:15, 19 October 2005 (UTC)

I know the basic cycle of when web browsers recognized the utf-8 BOM but have not heard of the force outside of windows. True that all major web browsers now accept utf-8 with or without a BOM. There were times when sending utf-8 with a BOM would stop html web browsers. Now Web3D X3D browsers are arriving doing that same failure. All VRML and X3D Classic VRML encodings have always specified utf-8 with # as the first character for delivery and storage. However, the appearance of the apparently 'optional' BOM has caught a few of the old-school utf-8 out and make the utf-8 BOM illegal by requiring the first character to be # Are there other examples where the apparently 'optional' utf-8 BOM has generated problems and solutions that can be documented here? Thanks and Best Regards, JoeJoedwill 08:04, 12 November 2007 (UTC) .

Well it also plays havoc with Unix expecting shell scripts to start with #!. —Preceding unsigned comment added by Spitzak (talkcontribs) 20:36, 12 March 2008 (UTC)

Double UTF-8

I recently ran into a problem with a double-UTF8 encoded document. Doing a search on the internet for double UTF-8 I see this isn't an uncommon problem. If a BOM exists in double-UTF8 encoded text, it appears as "C3 AF C2 BB C2 BF". If there were some mention of this problem and how to cope with it here, I could have saved some headaches trying to figure out what was going on. I think this problem should be addressed somehow in the article. Boardhead 16:46, 11 April 2007 (UTC)

xyz

I see that there are now not only x'es, but also y's and z's in the table. While it was difficult to understand before, it's now starting to be more confusing than informative. Could we possibly revert to only using x? Rōnin 19:38, 28 November 2005 (UTC)


I can see xyz in the table, but description in the table looks like "three x, eight x" for "00000zzz zxxxxxxx" UTF-16 entry, and in other places of the table - just x, no y nor z. —The preceding unsigned comment was added by 81.210.107.159 (talkcontribs) 2006-04-18 21:46:20 (UTC).

I for one am more confused after reading the table than before reading it. I used to think I had at least a vague idea of how it worked, but I'm back to square zero now. If things could be rephrased it might be easier to understand. -- magetoo 19:40, 21 May 2006 (UTC)

I replaced the z's in the first three rows of the table with x's, since the "z>0000" note was redundant with the hexadecimal ranges column. I couldn't decipher the fourth row of the table, so I left it alone. IMHO the table would make a lot more sense if the UTF-16 column was removed. -- surturz 06:15, 14 June 2006 (UTC)

I've removed the UTF-16 column and y's and z's. It now uses only x (and 0) for a bit marker. I think it is much clearer now, I hope everyone agrees. Surturz 06:43, 14 June 2006 (UTC)

Someone has put x's y's and z's back in the table, but in a much better manner. It looks good now. --Surturz 21:35, 7 October 2006 (UTC)

"This is an example of the NFD variant of text normalization"

did you actually compare the tables to check this was the case before adding the comment? —Preceding unsigned comment added by Plugwash (talkcontribs)

no - the week before, I talked to the president of the Unicode Consortium, Mark Davis, and asked him whether there were any products that used NFD in practice. He said that MacOS X was the chief example of use of NFD in real products. --Alvestrand 21:08, 19 April 2006 (UTC)


conversion

is there any conversion tool from and to UTF-8 (e.g. U+91D1 <-> UTF-8 E98791)? Doing the binary acrobatics in my head takes too long. dab () 12:36, 24 July 2006 (UTC)

i find the bin to hex and hex to bin much easier, but that might be because i worked with chips. matter of practice. may be a small paper showing at least first 16 hex and their bin equiv, or, a txt file with link in desktop or quicklink, is another alternative. but a tool for that should perform better. ~Tarikash 05:41, 25 July 2006 (UTC).

um, I am familiar with hex<->bin, but honestly, how long does it take you to figure out 91D1->E98791? I suppose with some practice, it can be done in 20 seconds? 10 seconds if you're good? Now imagine you need to do dozens of these conversions? Will you do them all in your head, with a multiple-GHz machine sitting right in front of you? I'm all for mental agility, but that sounds silly. dab () 18:09, 29 July 2006 (UTC)

considering that utf-8 was designed to be trivial for computers to encode and decode it really shouldn't take any programmer that understands bit-ops more than a few minuites to code something like this from scratch. Even less to code it around an existing conversion library they were familiar with.
why exactly are you doing dozens of theese conversions in the first place? at that kind of numbers i'd be thinking of automating the whole task not using an app i had to manually paste numbers into. Plugwash 22:44, 29 July 2006 (UTC)

Clarify intro?

I had to read all the way down into the table in order to be sure we were talking about a binary format rather than a text format here. Is there a way of making the first two sentences clearer? By "binary format" I mean that characters apparently use the full range of ASCII characters, rather than just the 36 alphanumeric ones. Stevage 06:34, 16 August 2006 (UTC)

odd definition of text format you have there (btw ascii has 96 printable characters). Yes it does use byte values that aren't valid ascii to represent non ascii characters just like virtually every text encoding used today (cp850,windows-1252,ISO-8859-1 etc) does. Also note that UTF-8 is not a file format in its own right, you can have a dos style UTF-8 text file or a unix style UTF-8 text file or a file that uses UTF-8 for strings but a non text based structure. Plugwash 17:02, 16 August 2006 (UTC)
Ok, having read character encoding I understand better. Not sure how the intro to this article could be improved. Stevage 19:59, 16 August 2006 (UTC)
Yeah, the bit "the initial encoding of byte codes and character assignments for UTF-8 is coincident with ASCII (requiring little or no change for software that handles ASCII but preserves other values)." is either confusing, not English, or not correct, I'm not sure which. Is this better? : "a byte in the ASCII range (i.e. 0-127, AKA 0x00 to 0x7F) represents itself in UTF-8. (This partial backwards compatibility means that little or no change is required for software that handles ASCII but preserves other values to also support UTF-8). UTF-8 strings." It needs to be explained how it differs from punycode. --Elvey 16:36, 29 November 2006 (UTC)

Need reference for email client support

Of all the major email clients, only Eudora does not support UTF-8.

This needs a reference. Which are the major email clients?! 199.172.169.7 18:33, 17 January 2007 (UTC)

I've removed it, since it's gone unchecked for about three months. -- Mikeblas 14:30, 22 April 2007 (UTC)

stroke vs caps

Hi, I am trying to translate this page in Italian for it.wikipedia, and was wondering if someone could elaborate a bit about the part on stroke versus capital letters. --Agnul 21:42, 2 Dec 2004 (UTC)

(moved new subject to the bottom, per convention)
Since the word "stroke" doesn't appear in the article, I'm at a loss to identify what you mean. Could you clarify? --Alvestrand 22:54, 21 January 2007 (UTC)

Recent change in Advantages and Disadvantages Section

I'm pretty new to editing in Wikipedia, so I wanted to consult with some more experienced Wikipedians before messing with someone else's recent change.

An anonymous user just added this text under the Disadvantages compared to UTF-16 section:

For most Wester European languages, the strictly 8 bit character codes from the ISO_8859 standard work fine with current applications, which can be broken when using multibyte codes.

I see at least three problems with this change. First, there is at least one typo. Second, this is miscategorized under the UTF-16 comparison section, since it is comparing UTF-8 to legacy encodings. Thirdly, this sounds like an unsupported assertion coming from someone who doesn't like UTF-8. I didn't want to fix the first two problems without addressing the third.

Is there any way this addition can be salvaged by moving it under legacy encodings and editing to be more appropriate? Perhaps to talk about support for legacy encodings vs. UTF-8? I'm not even sure that the assertion in this item is true. In my personal experience, UTF-8 support in current operating systems, applications, and network protocols seems to be very good, and applications that don't support UTF-8 seem to be more of the exception than the common case. —The preceding unsigned comment was added by Scott Roy Atwood (talkcontribs) 21:21, 6 February 2007 (UTC).

Asian characters in UTF-8 versus other encodings

The article states that "As a result, text in Chinese, Japanese or Hindi takes up more space when represented in UTF-8." This is not strictly true of Japanese text-the Japanese often use the character encoding ISO-2022-JP to communicate on the Internet, which is a 7-bit encoding in which characters take up more space than in UTF-8. For the Shift-JIS encoding used in Windows, however, it does hold true. Rōnin 01:44, 9 February 2007 (UTC)

The relationship between the size of UTF-8 encoded text and ISO-2022-JP encoded text is somewhat more complicated than for some of the other Asian encodings, but it is still generally true that UTF-8 encoding is larger. It is true that it is limited to 7-bit ASCII characters, but it uses a three byte escape sequence to begin and end a run of Japanese characters, and within the escaped text, each Japanese character takes two bytes. So that means ISO-2022-JP text will take fewer bytes than UTF-8 as long as the run of text is longer than six Japanese characters. Scott Roy Atwood 19:08, 9 February 2007 (UTC)
Ah... Thanks so much for the explanation! Rōnin 22:05, 14 February 2007 (UTC)

Java modified UTF-8

"Because modified UTF-8 is not UTF-8, one needs to be very careful to avoid mislabelling data in modified UTF-8 as UTF-8 when interchanging information over the Internet."

It's usually safe to treat modified UTF-8 simply as UTF-8, so this "very careful to avoid mislabelling" is exaggerating.

Not really. The treatment of code point 0, and code points beyond 65535 are not compatible. MegaHasher 04:23, 9 July 2007 (UTC)

"Legacy" vs "Other" encodings

De Boyne and Imroy - can you please give your arguments here rather than flipping the article back and forth between "legacy" and "other" encodings? I have an opinion, but want to hear both your arguments for why you think it's obvious enough to make these changes without discussion. --Alvestrand 17:33, 16 October 2007 (UTC)

At the risk of sounding juvenille - de Boyne started it. It doesn't appear to be a technical matter at all, but a grammatical one. He seems to have a pet peeve about the (mis)use of the word 'legacy' and has even written a page about it[6] on his ISP's web host. I've pointed out to him that 'legacy' is being used as an adjective in "legacy encoding", even showing that OED says it's ok. I thought my arguments had settled the matter last month but then he came back yesterday and restarted his efforts to 'fix' this article. He's now claiming original research, saying we made up the term or something...
I've been trying to argue the matter with him on his talk page and am currently waiting on a reply. --Imroy 19:08, 16 October 2007 (UTC)
I agree with the above. What I don't understand is where JdBP got the idea that "legacy" is pejorative in the first place. "Legacy" in computer science has always been a neutral descriptive term like "naïve" in mathematics and philosophy (naïve set theory, etc.). It's true that many non-mathematicians misinterpret this use of "naïve" as an insult, but the solution is to educate those people, not to switch to a less precise term.
The other problem is JdBP's antisocial behavior. He obviously should have started a discussion here after the first revert; instead he just kept making the edit over and over. If I'd been in Imroy's shoes I would have reverted those edits on general principle regardless of the content. -- BenRG 02:03, 17 October 2007 (UTC)
This other user is beside the point. The question here is whether the article should be using the term "legacy encoding" to mean "non-Unicode encoding". It's not that the term "legacy" is pejorative, it's just that it expresses a point of view. As such, it's inappropriate for a supposedly neutral encyclopedia article. --Zundark 10:15, 17 October 2007 (UTC)
The common non unicode encodings were introduced by IBM, MS and ISO. Afaict all of those have moved on to unicode now and are keeping thier old encodings arround only for historical/compatibility reasons. Do you have any evidence to the contary? Plugwash 11:09, 19 October 2007 (UTC)
I don't have any meaningful evidence either for or against your hypothesis. But I don't think your question is very relevant anyway, as there are significant character encodings that are not the product of any of the organizations you mention. Do you have any evidence that a character encoding as recent as KZ-1048 is regarded by its creators as a legacy encoding? Would it be NPOV for us to call it such? --Zundark 13:34, 19 October 2007 (UTC)

Better rationale for the introductory comment on higher plane characters being unusual

In the lede the encoded length of higher plane characters is excused by noting that they are rare. This is an excellent rationale, but the proper context of those sorts of reasons is not spelled out: it is a result in text compression that coding rare symbols with lengthier encodings and the more common ones with tighter ones, in the average causes the length of the entire text to sink. Because of that I think a link to at least one relevant article on statistical data compression should be included in the introduction, presuming the abovementioned apology is.

That might also help us tighten the (already quite bloated) introduction a bit, and make it more encyclopedic. It might also be that such reasons should be factored out from here and into the main article on Unicode/ISO 10646, because they tend to impact all encodings more or less equally. After all, they are always applied in the context of the design of the parent coded character set, and not of the encodings.

Decoy (talk) 22:39, 20 December 2007 (UTC)

Code range values in table within Description section

Why list two ranges "000800–00D7FF" and "00E000–00FFFD" for the 3-byte code?

Why not "000800–00FFFF"?

The range "00E000–00FFFD" followed by the range "010000–10FFFF" leaves a gap -- it doesn't account for 00FFFE and 00FFFF. --Ac44ck (talk) 21:41, 2 January 2008 (UTC)

The note in the middle of the table for the two-byte code says: "byte values C2–DF and 80–BF". Wouldn't the first byte be in the range "C0–DF"? --Ac44ck (talk) 23:36, 2 January 2008 (UTC)
C0 and C1 are invalid in UTF-8 because they would always code for overlong sequences. Plugwash (talk) 00:54, 3 January 2008 (UTC)
Aha. The UTF-8 code splits to occupy two bytes as the number of the character being encoded increments from 0x7F to 0x80. Bit 6 of the one-byte form moves to become bit 0 in the most-significant byte of the two-byte form — so the MSB can't be C0. And then incrementing the value represented by the distributed bits (which are all set) forces a "carry" to set bit 1 and clear bit 0 of the MSB — and the MSB can't be C1. Thus, "c2 80" is the first valid two-byte form:
Unicode → UTF-8
U+007F → 7f
U+0080 → c2 80
Thanks. —Ac44ck (talk) 02:43, 3 January 2008 (UTC)

Consolidated and extended the input range for three-byte codes

The top end of the "code range" for three-byte UTF-8 was listed as FFFD – omitting FFFE and FFFF.

The character code FFFF has a three-byte UTF-8 code: EF BF BF.

Perhaps FFFE was omitted because it can be a Byte Order Mark (BOM). Some web pages note that the BOM is not a valid UTF-8 code (no value encodes to "FF FE"). Nor is FFFF a valid UTF-8 code. But these facts are unrelated to which values may be encoded.

I see no reason to exclude FFFE or FFFF from the code range, nor to break the range 0000-FFFF into two sub-ranges. -Ac44ck (talk) 18:44, 4 January 2008 (UTC)

The BOM is U+FEFF, not U+FFFE. Noncharacters like U+FFFE do have UTF-8 encodings, according to Table 3-7 of Unicode 5.0, and they should be included in the range. Surrogates do not, and they should be excluded from the range. ED A0 80 is not a valid UTF-8 byte sequence; in no sense whatsoever does it decode to U+D800. -- BenRG (talk) 22:10, 4 January 2008 (UTC)
This seems like overreaching to me:
Page 41 of 64 in pdf file; Labeled as page 103 in the visible text:
Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.
Unlike a leading byte of C0 for a two-byte code, there is nothing inherent in the UTF-8 structure which prevents (or creates any ambiguity when) encoding "D800" as "ED A0 80".
A screw driver isn't designed to be used as chisel. That doesn't stop me from chiseling with a screw driver sometimes. Likewise, data-encoding mechanisms may be used for other than their intended purposes.
http://en.wikipedia.org/wiki/UTF-8
  • Java ... modified UTF-8 for ... embedding constants in class files.
Might one of those constants be in the range D800..DFFF?
The article also notes:
  • data compression is not one of Unicode's aims
But one might want to use UTF-8 for that purpose. A value in the disallowed range might be part of the data being compressed.
The meaning of data in the range D800..DFFF is restricted only by a Unicode convention, but there seems to be no functional problem with encoding any value in that range as UTF-8.
What is allowed by Unicode and what is doable in UTF-8 seem like separate issues to me.
I added a note to the table to assist others who may be as surprised by the exclusion from the range and unfamiliar with this restriction by Unicode as I was.
http://en.wikipedia.org/wiki/Byte_Order_Mark
Representations of byte order marks by encoding
  • UTF-16 Little Endian FF FE
-Ac44ck (talk) 05:13, 5 January 2008 (UTC)
UTF-8 (as defined in Unicode 5.0 and RFC 2279) has no encodings for surrogates. It's impossible by definition to encode a surrogate in UTF-8. I'm not trying to be dogmatic, I just want the information in the article to be correct. What you're talking about is a variable length integer coding scheme that's used by CESU-8, UTF-8, and some other serialization formats, some of which are not even character encodings. That scheme is not called UTF-8 as far as any standards body is concerned, but some people do call it UTF-8, and maybe the article should mention that.
The byte sequence FF FE and the code point U+FFFE are different things. The BOM is U+FEFF, and it has many encodings, one of which is the byte sequence FF FE. -- BenRG (talk) 01:37, 6 January 2008 (UTC)
I hear you. It is important to make it clear that the range D800..DFFF is disallowed by Unicode. A parser which is built to decode a UTF-8 stream to Unicode code points is broken if it doesn't complain about receiving data which resolves to a surrogate.
But the Unicode standard isn't a natural law. It is not "impossible" for values in the D800..DFFF to be transformed by the UTF-8 algorithm. The table at http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt (linked in the article) suggests that the range D800..DFFF wasn't excluded from the original implementations of UTF-8. But the range D800..DFFF is prohibited by Unicode these days.
This was also helpful: "The byte sequence FF FE and the code point U+FFFE are different things." I wasn't making the distinction. Whether a value is transmitted as little- or big-endian doesn't affect the magnitude of that value; and a "code point" is a value, not merely byte sequence.
Thanks.-Ac44ck (talk) 19:45, 6 January 2008 (UTC)

Why length matters

I reverted a change that seemed to argue that the only place size really matters is in database design. I don't agree - I think text storage and forms handling are cases where size (and in particular knowing size ahead of time) also matter. I kind of agree that size doesn't matter that *much* any more, but didn't see a reason to agree with the database focus. Feel free to argue with me. --Alvestrand (talk) 10:08, 21 January 2008 (UTC)


Stop spamming

I just removed a comment made by certain Software vendor regarding certaing file format that hapens to have (broken) suport UTF-8. It is thus completely irrelevant. 90.163.53.5 (talk) 19:32, 19 February 2008 (UTC)

Ok, that paragraph in the 'history' section was misplaced and perhaps not really relevant to UTF-8. But how was it "spamming", and why do you say it was a "comment made by [a] certain Software vendor"? I don't follow. --Imroy (talk) 09:12, 20 February 2008 (UTC)

Maybe the paragraph about how APIs work is unnecessary

There is a flurry of editing about how Win32/Windows/WinNT/etc. work.

I am not clear about the distinctions between the various Windows-sounding names. The extent to which I need to be concerned about the distinctions between them may suggest a flaw (and/or a lack of _user_-focus) in the development philosophy being inflicted upon the industry.

How a vendor persists in creating a moving target is secondary to the issue _that_ they persist in creating a moving target.

The article is about UTF-8.

A vendor whose operating system:

  • 1. Requires gigabytes of memory as overhead, but
  • 2. Is not designed to run 16-bit applications

may have a questionable commitment to "maintaining compatibility with existing programs" anyway. If a change in how they implement an operating system creates duplicate work for them, maybe it isn't all bad for them to feel a little of the pain that they inflict upon others.

It seems to me that expending a great deal of effort on how various products-with-Windows-in-the-name do or don't handle UTF-8 would be more appropriate in an article about Windows than in this article about UTF-8. -Ac44ck (talk) 03:30, 1 March 2008 (UTC)

I was just trying to make the example more accurate. Win32 (which I probably should have linked) is the name of the Windows API that has duplicated narrow-character and wide-character versions of everything. -- BenRG (talk) 20:51, 1 March 2008 (UTC)
I disagree with Ac44ck's assertion. As far as I can tell, the discussion about the Windows API occupies a small amount of the article's content, and implementation has been an important factor in the development and deployment of UTF-8.—Kbolino (talk) 04:26, 2 March 2008 (UTC)
My contention is that an article about UTF-8 is not the place to have a pissing contest about how one interprets "Windows", "Win32", "WinNT", ad nauseam.
It is one thing to say "such and such was the case in version 4 of UTF-8, but it was changed to behave thus and so in version 5 because ..."
But going into detail about "how Windows works today" is the stuff of a magazine article — not an encyclopedia. It's a good bet that the next "new and improved" version of "Windows" (which despite the common name doesn't mean _one_ thing — it is a moving target) will create incompatibilities with what "Windows" means to all users (of any version) anyway. Going into detail about "how Windows worked yesterday" is the stuff of a magazine back-issue.
The whole notion of keeping "current" with "change for the sake of change" (or perhaps someone else's profit as opposed to the user's benefit) has become tiresome.
An article about UTF-8 is on my watch list. I don't have an article about Windows on my watch list. Having my attention drawn to a pissing contest about what label applies to a moving target became irritating to me.
I don't know whether I can avoid the "let's do it this way today" nonsense by moving to Linux, but I hope to. Then all the controversy about what a particular version of somehow-called "Windows" does with various flavors of UTF would be of even less interest to me in an article on UTF-8. —Ac44ck (talk) 07:34, 2 March 2008 (UTC)

how many points, and how?

My apologies in advance. I just can't follow the explanation given in "Description". My maths don't work: 128 points in the range 0xxxxxxx make sense; but then, it seems to me that the range 110xxxxx 10yyyyyy allows 2048 points, and 1110xxxx 10yyyyyy 10yyyyyy leaves room for 65536. Even taking into account the footnote regarding the prohibiton of certain range, my maths still don't work. Please, someone clarify. Louie (talk) 16:59, 13 June 2008 (UTC)

You subtract the 128 from the 2048 to get 1920, and you subtract the 2048 from the 65536 to get 63488, and then you subtract another 2048 for surrogates to get 61440. Not subtracting would leave the possibility of "overlong forms" (the bane of UTF8)... AnonMoos (talk) 17:35, 13 June 2008 (UTC)
Right: but then, which ones are "removed" to make the adjustments. The scheme allows for full 2048 and 63488 (removing surrogates) slots in the range of 2 and 3 bytes; so I just guess: the first 128 and 2048 in those positions are arbitrarily removed.
Moreover: can't make heads or tails for the 4 byte scheme: 11110www 10xxxxxx 10yyyyyy 10zzzzzz leaves room for 8388608 slots; if I understand right, only 1/8 is actually used, but that leaves the round number 1048576, and not the intended 1112064 defined for UCS.
Seems to me that there is some hint at the answer to my questions in the last paragraphs and the table at the end of the section, but the wording is too terse, and needs some improvement for legibility. Louie (talk) 17:32, 16 June 2008 (UTC)
The scheme is designed to make decoding simple, not to squeeze every last bit out. You could have encoded 1 as 110 00000 10 000001 (2 bytes), but you disallow that because it can also be encoded as 00000001 - which is shorter. If it can be shorter, it must be shorter. --Alvestrand (talk) 18:17, 16 June 2008 (UTC)
Got it! I've made some changes to the article to reflect my (current) understanding of the subject. I hope it is useful to others too. Louie (talk) 21:08, 16 June 2008 (UTC)

What's a Western language?

Someone changed "most" to "all" in the sentence Text in all Western languagess will be smaller in UTF-8 than it would be in UTF-16 because of the presence of spaces, newlines, numbers, unaccented Latin characters, and punctuation.

For all the Latin-based scripts, that's probably true. But do we consider Serbian, Greek or Bulgarian "western" languages? And (if I'm feeling grumpy) what about Cherokee and Deseret, which have Plane 1 (4-byte) encodings? If they are to be considered "western", we'd better figure out whether the statement is true for them. If not, we'd better restrict "western" somehow. --Alvestrand (talk) 22:01, 31 October 2008 (UTC)

I assume that "languages written in the ISO-8859-1 character set" was meant (in Miscrosoft Windows, CP1252 is known as the "Western" encoding...). AnonMoos (talk) 22:18, 31 October 2008 (UTC)
After writing the above comment, it dawned on me that the previous sentence probably was intended to cover Cyrillic and Greek as well; they're encoded in 2 bytes per character in UTF-8, but the spaces will be one byte, leading to a more compact representation than UTF-16. But Cherokee and Deseret, being Plane 1 languages, lose. I'll rewrite. --Alvestrand (talk) 06:44, 1 November 2008 (UTC)
Assuming they used ascii spaces/punctuation plane 1 languages would do better in UTF-8 than UTF-16, it's stuff in the higher bits of the BMP (mostly CJK but also a few rarer alphabetic scripts) that lose in UTF-8. Plugwash (talk) 23:18, 1 November 2008 (UTC)
You're right. I keep forgetting that UTF-16 uses 4 bytes for plane 1 characters. --Alvestrand (talk) 20:13, 2 November 2008 (UTC)

UTF-8b?

Someone added a couple of comments referring to UTF-8b to this page. Since the comments were not referenced, and UTF-8b doesn't seem to exist yet, I'm reverting - there are lots of tricks people have wanted to do with UTF-8, and unless it's documented AND reasonably widely accepted, I don't think it should be on this page. --Alvestrand (talk) 12:27, 11 November 2008 (UTC)

I saw this in a discussion of Python 3.0. It is less to do with UTF-8 and more of a way to solve problems with UTF-16. As mentioned in the advantages page, a huge problem with UTF-16 is that it cannot store an invalid UTF-8 (or any other byte-based) encoding. This causes a lot of problems in handling URL's and page contents that have no guarantee of being valid. Myself I know that UTF-16 is useless because of this, but for me I am trying to handle metadata attached to image files where the original data must be saved. UTF-8b is really an alteration of UTF-16 where 128 low-surrogate codes are instead assigned a meaning of "this invalid byte appeared here". A big problem with it is that you cannot losslessly translate to UTF-8 and back again. Spitzak (talk) 04:26, 1 December 2008 (UTC)

Examples in rationale that I don't understand

I'm either misreading the examples, or missing something fundamental. In the discussion of changes necessary for a program to manage UTF-8 vs ASCII, these examples are given:

  1. File I/O special cases (at most) the ASCII newline character.
  2. Many programming and scripting languages accept all the bytes unchanged up to the closing ASCII quote character and thus support UTF-8 string constants without changes.
  3. Filenames and URLs special-case only ASCII punctuation such as slash and colon.
  4. Case-independence of ASCII only special-cases the ASCII letters

Why does a program need any special-case handling when reading or writing these characters (newline, quote, slash, colon)? These are all represented in the ASCII set, right? The binary value of a stream "\hEllO:/" would be identical in UTF-8 or ASCII, wouldn't it? (disregarding the optional BOM).

Likewise, CR and LF are in the ASCII set, why would a UTF-8 encoding require any special handling vis-a-vis ASCII?

What am I missing? Leotohill (talk) 01:53, 19 November 2008 (UTC)

Don't think those are a list of necessary changes, but examples of how many programs are only sxensitive to a small number of ASCII delimiter characters, and handle the material between delimiters as unanalyzable atoms. See my recent comments at User talk:JustOnlyJohn‎... AnonMoos (talk) 05:11, 19 November 2008 (UTC)
Ok, I understand now. Watch for my edits to come. Thanks. Leotohill (talk) 15:45, 19 November 2008 (UTC)

Backward compatibility

I've reverted some edits - here's the explanation.

Spitzak wrote: "The program will work if it treats any sequence of characters with the high bit set as a "word" that it does not split or alter, and if any code that draws a representation of the string understands UTF-8. A suprisingly large number of API's obey these rules, virtually every one on both Unix and Windows. "

Program's don't just use API's, they also scan and manipulate text themselves.

But just like the API's, most programs do not treat non-ASCII characters specially.

"Depending on your definition of "character" this program will "fail" by not copying 10 of them." I've added text to clarify that "character" means... character. (not byte)

You either are using "unicode code point" or perhaps "UTF-16 word" as your "character" apparently, but some people consider combining diacriticals or other normalized form conversions as how a "character" is defined. What you are saying is "it is difficult to copy the correct portion of text that will take a fixed amount of memory when converted to some other encoding or normalization form" Well sorry, that is true of every single encoiding in the world.Spitzak (talk) 04:48, 1 December 2008 (UTC)

"It is incredibly likely the calling program expects the "failing" behavior of copying 10 bytes. Copying more bytes by (for instance) copying the first 10 Unicode code points will likely result in a buffer overflow."

If the calling program also is ASCII-oriented, then yes, the calling program would expect 10 chars = 10 bytes.

YES!!! This is the first insightful thing you have said yet. Thank you.Spitzak (talk) 04:48, 1 December 2008 (UTC)

Leotohill (talk) 01:36, 26 November 2008 (UTC)

Long exposure to ASCII, where the conversion between "byte offset" and "character count" has led the terms to be freely interchanged. This means that many algorithms have been defined to take or return "characters". For some reason it is difficult for many people to substitute "byte count" and instead they assume these algorithms must be rewritten to count/return some other definition of "character". This obviously makes them very slow and the fact that invalid encodings can happen means they are impossible to write.

The solution is to NOT rewrite them, but instead to substitute "byte offset" for "characters" in the documentation. For some reason this is very difficult for a lot of people. They worry that somehow nothing will work without having direct, continuous knowledge of where the character boundaries are. You need to think a little. One way is to think about "words". You will notice that all kinds of useful word processing functions do not have to worry about words at all, and the words somehow (amazing!) preserve themselves due to the simple fact that all useful functions don't shuffle the bytes randomly! The fact that a stream can be broken in the middle of a word, or that you can make a pointer to a middle of a word, does not mean that it is impossible, or even difficult, to write a program to word-wrap correctly!

There has already been several attempts to address these misconceptions in the "string length" portion right below, but obviously it is not working. Perhaps an all-new section can be written that merges all these.

Spitzak (talk) 04:48, 1 December 2008 (UTC)

spitzak, I think you are misunderstanding the point of the example. The program is defined as one that copies 10 characters. From the program user's point of view, that would mean 10 glyphs. (I suppose that in some writing systems there is not a 1:1 character:glyph relationship, but that needn't be introduced into this example.) My statement of the program's purpose, that it is to copy 10 characters, means that it is to copy enough bytes so that the first 10 glyphs are properly preserved. I really didn't want to have to explain all this in the example, because it would detract from its effectiveness.
You say " But just like the API's, most programs do not treat non-ASCII characters specially."

The point is, any program that assumes that 10 bytes = 10 characters can be defective when handling UTF-8 that contains non-ASCII characters. Consider a method GetShortName() that receives a null-terminated string and returns one. If the program simply returns the first 10 bytes (+ null), it's done wrong, by returning fewer characters than expected (character as defined above) or even returning only part of the multiple bytes necessary to represent a character. Your argument that it is "incredibly likely" that the calling program expects 10 bytes doesn't wash: the calling program is just as likely to be ignorant of the length of the expected result. The GetShortName() method would allocate a (new) buffer to represent its result value.

The problem is that in the real world, there is NO "GetShortName()" like you describe. Academic exercises are not relevant when discussing whether something is useful or not. Any possible program called GetShortName() in actual use would expect the text to fit into a fixed-sized buffer, and thus expects 10 bytes. Or it expects 10 UTF-16 words, which are not 10 "characters" any more than 10 UTF-8 bytes are.
Believe me, I understand the difference between a character and a byte, and that's exactly what I'm trying to illustrate with this example. A program that is supposed to manipulate characters isn't going to be fixed by changing its documentation to say that it manipulates bytes. All you've done is make the program and the documentation agree, but the program is still wrong. It's only going to be fixed by changing it to handle multi-byte characters. Leotohill (talk) 16:40, 1 December 2008 (UTC)
The problem is that NO PROGRAM MANIPULATES "characters". Thus the documenatation is WRONG, and the program is RIGHT. The fix literally is to change the documentation to read "byte offset" in place of "characters". PLEASE, look at REAL software!!!!! I rewrote an entire ISO8859-1 text editor into UTF-8 and the ONLY changes needed were to the character-advance and character-backup functions, and to the display. The amazing fact is that the result is completely bulletproof (ignoring bugs that were in the ISO8859-1 version). It is impossible to put the cursor between bytes in the UTF-8 or to cut or paste or insert an invalid UTF-8 sequence. Believe me, I thought the same as you did before I did this, but it was very convincing. I really recommend you do something similar and you will learn.Spitzak (talk) 03:35, 4 December 2008 (UTC)

Let me try again. This is not really hard...

All your examples act as though some value "10" will magically appear out of nothing and that is the amount to move. But if you actually analyse real software, you will find that the "10" is ALWAYS produced by scanning the exact same string. For instance strlen() scans for the null. The return value can be in any kind of unit that represents that distance in the same string, but I recommend "number of bytes" because it is very fast and easy to implement, and the value is actually useful (it helps you allocate the right amount of memory to store the string).

No UTF-16 software seems to worry about "characters" and everybody seems happy to use "number of 16-bit words" as the offset. Same thing for all the usage of older multibyte Japanese encodings. But for some reason everybody goes crazy when the exact same rules are applied to UTF-8. This is wrong!Spitzak (talk) 03:46, 4 December 2008 (UTC)

sorry, but I'm still not understanding you. Or you aren't understanding me.
When you say that my examples act as if the number 10 "magically appears out of nothing" : No. The number 10 comes from the program specification. The program is supposed to grab 10 characters. No scanning is involved. It's just a grossly trivial program that, at it's core, takes the bytes from address range <start of buffer> to <start of buffer + 10>. It isn't using any API or string functions. This works fine if byte=character. It fails when byte != character.
So tell me again, using simple words: what's wrong with this example? Isn't the program going to produce incorrect output, possibly even invalid utf-8, when processing multi-byte utf-8 data? Leotohill (talk) 05:01, 4 December 2008 (UTC)
All your arguments are basically the same as saying "in encoding A it is difficult to measure N units of encoding B, but in encoding B it is easy to measure N units of encoding B". I can claim that UTF-32 is "bad" because it is hard to measure N UTF-8 bytes in it and that argument is just as valid as yours. You have some belief that for some reason some unit you call a "character" is much more common to measure things in than any other, but I have tried to point out over and over that you are basing everything on this false assumption. The only reason "number of characters" appears ANYWHERE is because in ASCII this was synonymous with "byte offset" and was used in it's place everywhere. Please, look much more carefully at how softare is using "number of characters" and you will find this is ALWAYS true!!!!
Another thought experiment: try to figure out why software like word processors have no difficulty despite the fact that "number of words" and "number of sentences" and so on are "hard" to calculate. How does "move by word" work when all the words are different sizes? How come a mark put between words does not jump into the middle of the words? According to you this is incredibly difficult, but in fact it apparently has been a solved problem for decades.
What is so special about a "character"??? Really, please think before answering again. Come up with a REAL example where "10 characters" is produced other than "I want to measure 10 things that I have purposely chosen to be difficult in UTF-8"Spitzak (talk) 20:12, 4 December 2008 (UTC)
You are really not understanding me. Stop thinking of the example as a general text processing program. All I'm saying is that a program that takes the first 10 bytes of a utf-8 buffer could give incorrect results. I'm not saying that it's hard to write a program to do the right thing, I'm just saying that its easy to write a program that does the wrong thing.
I'm going to try again and get even more basic. I'll ask you a question and a followup question.
1) Is it possible to write a program that produces correct results when processing single-byte encodings but produces incorrect results when processing multi-byte utf-8 encodings?
For ANY encoding I can make it produce the "wrong" answer where the "right" answer is "find the point that is 10 units later in some different encoding. That is what all your examples are equivalent for, but this is hardly unique to UTF-8.
Of course it's not unique to UTF-8. But if we want to claim that UTF-8 offers some measure of compatibility with programs written for ASCII (which I think is worth saying), then it makes sense to explain conditions under which this can fail. Leotohill (talk) 04:35, 10 December 2008 (UTC)
A *REAL* example of a program in actual use that would fail without rewriting in UTF-8 is a program designed to display on the screen that physically seperates all the string into individual bytes, calls the system with each byte independently to get a "width", and then adds these widths together to get the total area the string will take (technically this will fail for *any* splitting of the string, no matter what encoding is used, due to Unicode combining characters and directional marks, but you seem to be ignoring that and for many people it is true that they will consider a program that works for left-to-right appended glyphs ok).
Another example of a *REAL* program that fails is a text editor that moves the cursor by one byte in response to a left/right arrow, these do have to be fixed to move by an entire UTF-8 character. This was in fact the only change I had to make to fix a text editor to work with UTF-8.Spitzak (talk) 02:13, 10 December 2008 (UTC)
Ok, there you go. A program that didn't process UTF-8 properly until it was changed. That's all I'm saying: some programs won't work properly. Leotohill (talk) 04:35, 10 December 2008 (UTC)
2) if your answer is that it is impossible, then we have a very basic disagreement. If your answer admits the possibility, please describe the simplest possible program that does so. Then we'll have two examples: yours and mine, and we can choose which one is best.
Leotohill (talk) 22:53, 4 December 2008 (UTC)

The problem with all your examples is that you are never stating where the "10" comes from. The problem with these examples is that in the real world, that "10" is ALWAYS produced by measuring the very same string, and thus by fixing the measuring part of the program you can use a different unit, such as byte offset.

You continuously reply with "well no, I am defining the breaking program as having to move 10 glyphs". But I consider that example invalid. You can define a program that will be hard to write for *any* encoding, the easiest way is to take something that is trivial in a different encoding and ask for it. I can say that UTF-16 is terrible because I can define a function, say "move 10 UTF-8 bytes", that is "hard" to do in UTF-16. That is an equally invalid argument.Spitzak (talk) 02:18, 10 December 2008 (UTC)

This reply shows that are are misunderstanding the point of the example. You seem to be thinking that I am trying to illustrate that one encoding is better than the other. No: the point is to explain how a program written for one encoding can fail when processing a different encoding. It's not at all about the goodness or badness of an encoding. This is no different than explaining how a program written to process UTF-8 (only) fails when given UTF-16, or vice-versa. It's not a judgment on the benefits of one encoding vs. the other, and it's also not about good or bad programming technique. We simply want to let the reader know that some programs, created to process one kind of encoding (ASCII), can fail when processing a different type (UTF-8). Do we agree that this a valid point?
For an example of a program that can process ASCII but not UTF-8 (or, for that matter, any Unicode), see Vedit. Also, any program written for Microsoft-DOS is also likely to fail, because DOS did not support Unicode, and neither do the DOS APIs that were ported to the Windows command shell CMD.EXE. Leotohill (talk) 04:22, 10 December 2008 (UTC)
Those programs were based on 8-bit OEM code pages (CP 437 etc.). Programs which were originally intended to handle 7-bit ASCII inputs, and which were not heavily involved with detailed text manipulations or multilingual font displays, have a reasonably high probability of being able to handle UTF-8 inputs either unaltered, or with relatively minor and trivial coding changes -- that's all that was meant by "backward compatibility" on the coding side (see User_talk:JustOnlyJohn). AnonMoos (talk) 04:36, 10 December 2008 (UTC)
Programs that handled 8-bit (extended ASCII/ANSI/OEM) also handled 7-bit, of course, but they may fail with UTF-8. So these remain programs that handled ASCII input fine, but not UTF-8.
I agree with both you and JustOnlyJohn in that discussion on his talk page. You are both saying that the backwards compatibility is not 100%. All this example is meant to do is to illustrate that. Some programs fail. If someone wants to add in commentary about the expected ease of modification, that's ok. Of course, it depends upon availability of the source code, a compiler, and a programmer. (s) Leotohill (talk) 06:29, 10 December 2008 (UTC)
I don't think you're really getting my point -- if a program expects CP437 inputs, that's actually different from a program which expects ASCII inputs, and if a program fails because it passes a UTF-8 string unaltered from its input to the MS-DOS API, and MS-DOS doesn't understand UTF-8, then according to the philosophy of those who devised UTF-8, that's really a DOS problem or an interface problem, not a problem with the program itself. As I said on the user talk page, UTF-8 "backwards compatibility" from the programming/coding point of view was a way of making the simple cases of extension from handling ASCII to handling UTF-8 simple to program, but it was not a 100% guarantee of anything (nor intended to be so). AnonMoos (talk) 07:03, 10 December 2008 (UTC)
And I think you aren't getting my point. Leaving aside the dos-app example for the moment, because it's an unnecessary distraction, let's focus on the meaning of "backwards compatible". Are you saying that the intent was to only provide b.c. to programs when the UTF-8 data stream is limited to ASCII characters? If yes, then maybe we should make that explicit in the article. (though I don't think I agree.) If no, then they must have intended that some programs, written for ASCII, could still operate properly when the data stream contains non-ASCII characters (which is true, some can, under some conditions.) If the latter is the proper expression of the meaning of b.c. in this context, then it makes sense to illustrate some conditions under which such a program could fail. Whether the failure is a "program problem" or a OS problem is irrelevant, in two ways: first, the O.S. (or command shell) is of course a program, and CMD.exe is one example of a program that fails with UTF-8 input. Secondly, and more importantly, the point is that with non-ASCII input, you can get incorrect output.
If we want to talk about b.c. in this article, it makes sense to explain what it means. Leotohill (talk) 14:42, 10 December 2008 (UTC)
Read it again: making the simple cases of extension from handling ASCII to handling UTF-8 simple to program -- in other words, backwards compatibility from the programming point of view (as opposed to from the data point of view) means that various characteristics of UTF-8 were chosen with the goal of avoiding as much annoying tedious busywork for programmers as possible, but it's not a promise that programmers won't have to do any work (nor was it really intended as such). AnonMoos (talk) 18:03, 10 December 2008 (UTC)
I don't agree with your interpretation of the meaning of b.c., but we can avoid that issue as well. Let's ask the question: Will the reader be interested in knowing something about the conditions under which a program written to process ASCII can (unchanged) process UTF-8? I think the answer is yes. What do you think? Leotohill (talk) 22:29, 10 December 2008 (UTC)
Yes such programs are interesting, and I tried to give some *REAL* examples above. The problem is that saying "it is difficult to move 10 UTF-16 glyphs" is MISLEADING. That is NOT the type of software that fails, for the simple reason that it DOES NOT EXIST IN THE REAL WORLD. Software that fails is where it splits the text into individual bytes and then sends them in a way that the original order cannot be reconstructed (the best example I can think of is byte-oriented "width of each character" api). Every other objection is due to bad documentation that uses "number of characters" where "byte offset" should be specified, and due to programmers taking that literally as meaning some definition they have for what is a "character".74.62.215.125 (talk) 20:20, 27 February 2009 (UTC)
That last question may have sounded snarky, but I didn't mean it that way. I meant it as a sincere question. Leotohill (talk) 18:57, 12 December 2008 (UTC)

I'm not offended, but I don't know that I have much more to say in the current context of the discussion... AnonMoos (talk) 00:13, 13 December 2008 (UTC)

Colour in example table?

Would it be worth using colour to distinguish the journey of a byte through UTF-8 in the table of examples. Like I have done here in the first row:

Unicode Byte1 Byte2 Byte3 Byte4 example
U+000000-U+00007F 0xxxxxxx '$' U+0024001001000x24
U+000080-U+0007FF 110xxxxx 10xxxxxx '¢' U+00A211000010,101000100xC2,0xA2
U+000800-U+00FFFF 1110xxxx 10xxxxxx 10xxxxxx '€' U+20AC11100010,10000010,101011000xE2,0x82,0xAC
U+010000-U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx  U+10ABCD11110100,10001010,10101111,100011010xf4,0x8a,0xaf,0x8d

What do you think? (anyone know how to name styles in a SPAN tag?) --Surturz (talk) 05:14, 1 December 2008 (UTC)

I tried that (alternating red/green) and did not like it, before I made the underscore version.Spitzak (talk) 03:37, 4 December 2008 (UTC)

Actually, your version preserves the underscores as well as using color. That may be better.Spitzak (talk) 03:53, 4 December 2008 (UTC)

Removed paragraph

I removed this paragraph:

  • Conversion of a string of random 16-bit values that is assumed to be UTF-16 to UTF-8 is lossless. But invalid UTF-8 cannot be converted losslessly to UTF-16. This makes UTF-8 a safe way to hold data that might be text; this is surprisingly important. For instance an API to control filesystems with both UTF-8 and UTF-16 filenames, but where either one may contain names with invalid encodings, can be written using UTF-8 but not UTF-16.

You can't convert invalid UTF-16 to valid UTF-8. You also can't convert invalid UTF-8 to valid UTF-16. You can convert invalid UTF-16 to an "extended UTF-8", for example by encoding unpaired surrogates CESU-8-style. You also can convert invalid UTF-8 to an "extended UTF-16", for example by encoding invalid bytes as unpaired surrogates in the range DC80 to DCFF, as Markus Kuhn proposed in 1999. The situations are exactly symmetrical. UTF-8 has no advantage here, even if by "UTF-8" one means "nonstandard extension of UTF-8 invented for this specific purpose". -- BenRG (talk) 14:34, 1 December 2008 (UTC)

Though you cannot encode invalid UTF-16 to truly strictly valid UTF-8, it is pretty obvious how to do it (exactly like you propose) and the result passes most UTF-8 validators (in fact truly valid UTF-8 encoding seems to trip up more validators than CESU-8 does!).

You are mentioning something I have seen called UTF-8b. I have investigated this and it does result in the reverse situation. If all your UTF-8 to/from UTF-16 conversions follow the UTF-8b rules, then UTF-16 can losslessly encode any invalid UTF-8 stream. And the inverse is now false, you can make a UTF-16 stream that cannot be losslessly converted to UTF-8 (imagine it contains the half-surrogates arranged so the result is a legal UTF-8 stream). I do believe UTF-8b (which should really be called some kind of new assignment for Unicode code points and not have UTF-8 in it's name) might be useful for losslessly storing UTF-8 text into a database that wants UTF-16 or another non-byte encoding, but otherwise it's uses are not that great, and I cannot see it being used as a portable API layer, due to the fact that far too much code does not encode/decode it correctly. The "invalid UTF-8" encoding is enormously common, in fact I would estimate there are 10x as many decoders/encoders that do invalid UTF-16 "correctly" than do *valid* UTF-16 by actually translating paired surrogates into a single character. Also since the text of interest is usually UTF-8 already, far less encoding/decoding must be done.

I recommend you actually try to write some software that has to handle byte sequences that *might* be text, and you will very quickly find out, like I have, why this is probably one of the most important advantages of UTF-8 over UTF-16. I am going to put the paragraph back. Please feel free to reword it to make it clearer, however.74.62.215.125 (talk) 03:12, 4 December 2008 (UTC)

Sorry, but this paragraph just can't stay. UTF-8 has a formal specification. Your scheme violates that specification. End of story. If this paragraph went in the article at all it would have to be as a disadvantage of UTF-8, because what you're arguing is that it's better to ignore the UTF-8 spec in favor of something else. Note that what you're advocating isn't CESU-8 either: CESU-8 only permits six-byte sequences that encode paired high and low surrogates, it doesn't permit unpaired surrogates. Your scheme isn't covered by any current standard. Maybe it should be. I won't claim that you're morally wrong, but you are factually wrong, so the paragraph has to go. -- BenRG (talk) 17:04, 4 December 2008 (UTC)
No, you are doing the same thing as others above to try to make UTF-8 more difficult than it really is. Please read the above discussion which goes into lots of length about how it is OK to store invalid UTF-16 encodings in UTF-8. There appears to be people who are finally realizing they made a BIG mistake by changing all their code to "wide characters" and try to defend their choice by selectively choosing words in the UTF-8 spec to claim that you cannot possible use it to store some kinds of information. It is plenty obvious how to store invalid UTF-16 in UTF-8 and everybody does it exactly as stated. You also need to try to write some software that manipulates data that *might* be UTF-8 and you will quickly learn how vitally important this is and why translating everything to UTF-16 is VERY VERY VERY (!!!!) BAD!!!Spitzak (talk) 20:19, 4 December 2008 (UTC)

Noncharacters

I removed the following paragraph because it is incorrect:

  • Unicode also disallows the 2048 code points U+D800..U+DFFF (the UTF-16/UCS-2 surrogate pairs) and also the 32 code points U+FDD0..U+FDEF (noncharacters) and all 34 code points of the form U+xxFFFE and U+xxFFFF (more noncharacters). See Table 3-7 in the Unicode 5.0 standard. UTF-8 reliably transforms these values, but they are not valid scalar values in Unicode, and thus the UTF-8 encodings of them may be considered invalid sequences.

The 66 Unicode noncharacters described above ARE valid code points and do not cause encodings of them to be ill-formed. They can be used by applications for their own internal use. The Unicode standard forbids applications from exchanging noncharacters publicly, but doing so does not affect the well-formedness of such sequences. In particular, conformant Unicode implementations are allowed to handle unexpected noncharacters by deleting or ignoring them instead of throwing an error. 71.98.69.114 (talk) 05:42, 11 January 2009 (UTC)

Whoops, part of the paragraph was correct (the part about surrogates). I added that back. 71.98.69.114 (talk) 05:50, 11 January 2009 (UTC)
You sure about this? The documentation I saw described all of these using the same "non-character" syntax so they all looked equivalently "invalid". It seems to me that the invalidness of the surrogate pairs is really a quirk of the UTF-16 encoding, and that if UTF-16 is not being used they are no worse than any other code points that are "invalid".74.62.215.125 (talk) 05:23, 13 January 2009 (UTC)
Certainly the treatment of BOM assumes that 0xFFFE is invalid! Otherwise it may be possible to inject a wrong-endian UTF-16 BOM into a UTF-16 stream by encoding it in UTF-8. It does seem to me that anybody who thinks an unmatched surrogate half is "bad" should think this is far worse.74.62.215.125 (talk) 05:28, 13 January 2009 (UTC)
The standard forbids the public exchange of noncharacters such as U+FFFE, but they can nonetheless be encoded in well-formed UTF-8 strings according to Table 3-7 in the standard. For example, U+FFFE would be encoded as <EF BF BE>, which matches row 6 in the table. 71.98.69.114 (talk) 04:26, 18 January 2009 (UTC)
The problem is that the half-surrogates can be encoded exactly the same. I do not see the difference. If certain Unicode code points are not allowed in interchange, they should not be allowed. I am pretty certain a bad BOM will screw up far more programs than a half-surrogate-pair.
All this attempts to put a "hole" into UTF-8 seems to be a concerted effort to remove the ability to losslessly store UTF-16 in it. As far as I can tell this is due to some people realizing they made a terrible mistake by switching to "wide characters" and that UTF-8 will wipe out all their work, unless they somehow disable the ability to rewrite UTF-16 API's to use UTF-8, by eliminating the lossless ability to translate UTF-16 to it. Notice that these people NEVER complain that UTF-16 can hold half-surrogates (because it they were illegal in both encodings, it would still allow translation, and also because that will make handling UTF-16 enormously more difficult, while they only want UTF-8 to be difficult).195.112.50.170 (talk) 12:04, 11 February 2009 (UTC)
I'm not understanding this issue clearly. Can you describe why you would ever need to encode UTF-16 sequences in UTF-8 (pretending for the moment that this is possible), as opposed to decoding and then re-encoding the string in UTF-8? 71.98.69.114 (talk) 17:56, 21 February 2009 (UTC)
You need to do this if you have an invalid UTF-16 sequence and you wish to convey it through an API that wants UTF-8. For instance, imagine a UTF-8 api to a Windows filesystem, with a call to list the files in a directory. Nothing prevents those files from having invalid UTF-16 sequences, so somehow those must be returned. In addition it is very useful if the returned value uniquely identifies the file (for instance you may want to rename or delete the invalid files, so you must be able to name them!). People argue against allowing invalid encodings because they want to make such API's impossible so that their investment in wide characters is somehow justified.Spitzak (talk) 23:56, 26 February 2009 (UTC)
The exact same problem exists with trying to create a UTF-16 api to a filesystem that allows invalid UTF-8 names.
No the whole point is that this problem does //NOT// exist if you use UTF-8 to control a UTF-16 filesystem. Only if you try to artificially constrain UTF-8, by perhaps claiming that the obvious encoding of unpaired surrogates is somehow "illegal", can you cause this to happen. This artificial redefinition is strictly a desperate attempt to remove this obvious and very important superiority of UTF-8 over UTF-16.
There are methods of making UTF-16 losslessly encode UTF-8, and this does reverse things. However these ideas are at a serious disadvantage in that probably 99.9% of the UTF-8<->UTF-16 encoders do not obey it and will break it. The UTF-8 encoding of invalid UTF-16 is most likely obeyed by nearly all encoders in common use, in fact they are far more likely to encode //invalid// UTF-16 than //valid// UTF-16 correctly! (this is what CESU is).Spitzak (talk) 19:54, 7 March 2009 (UTC)
Nothing stops you from defining an API that says invalid UTF-16 will be represented by invalid UTF-8 in a particular way. IMO there is a good reason not to standardise this though, namely that making a standard that both makes sense to normal humans, is easy to process and provides a bijective mapping between all invalid UTF-8 and all invalid UTF-16 is very difficult. Plugwash (talk) 03:15, 28 February 2009 (UTC)
Everybody has standardized on a way to represent invalid UTF-16 in UTF-8. It is blatently obvious how to do so, and in fact there are far more UTF-8 encoders that encode invalid UTF-16 correctly than ones that encode valid UTF-16! It does make sense to humans, and it is easy to process. It is not possible to make a bijective mapping between UTF-8 and UTF-16, as you can arrange the codes used for errors in UTF-16 so that the resulting UTF-8 is valid and thus will not translate back to the same UTF-16. Trying to fix this by saying that an arrangement of UTF-16 errors that produces a valid UTF-8 is invalid and must be decoded to different UTF-8 helps, but the result is that there are UTF-8 strings that cannot be achieved and you are back where you started. You can repeat this over and over again, but as far as I can tell it is an infinite recursion trying to figure out what sequences are allowed and what ones are not. —Preceding unsigned comment added by 96.229.138.20 (talk) 08:50, 17 April 2009 (UTC)

ambiguous numbers in tables within "description"

Numbers in tables within "description" are ambiguous.

First table within "Description" section claim that start of 4-byte sequence is 11110zzz (i.e. 11110000-11110111, 0xf0-0xf7), while second table claim that start of 4-byte sequence is within range of "11110000-11110100", "F0-F4". Ranges are obviously different, and this needs to be fixed.

EvilBoxOfDoom (talk) 06:05, 31 January 2009 (UTC)

The only way you will get a byte value in the range F5 to F7 out of the mapping table is to put in a number that is not a valid code point as the second table makes clear "Restricted by RFC 3629: start of 4-byte sequence for codepoint above 10FFFF". Plugwash (talk) 15:37, 31 January 2009 (UTC)

UTF-8Y

the article should cite UTF-8Y as name for UTF-8 with BOM http://www.mail-archive.com/unicode@unicode.org/msg13911.html http://www.jedit.org/users-guide/encodings.html —Preceding unsigned comment added by 62.154.199.179 (talk) 10:05, 3 February 2009 (UTC)

The registry of charsets at [7] does not show UTF-8Y registered. The first of the links is a 2002 email speculating about the possibility of using the name. The second uses the term, but refers to no definition of it. I don't see either of these as satisfyng verification for the name being a defined term. --Alvestrand (talk) 10:38, 3 February 2009 (UTC)
jEdit is one of those applications that use UTF-8Y for "UTF8 with BOM". -- leuce (talk) 07:49, 7 April 2009 (UTC)
See link above - Jedit documentation refers to it, but doesn't define it. Wikipedia is all about references.... --Alvestrand (talk) 08:20, 7 April 2009 (UTC)
Well, I don't feel strongly about it. IMO the jEdit reference is pretty unambiguous. And there are many references on the web to UTF8y, most of which in context of issues with BOM, so its *existence* can't be disupted. -- leuce (talk) 08:46, 7 April 2009 (UTC)
Added: jEdit's user manual does not define UTF-8Y but the readme referring to its support does: [8] "Added support for UTF-8Y encoding, which is like UTF-8 except there is a three-byte signature (0xEFBBBF) at the beginning of the file." Interestingly, the article referenced by the jEdit developer [9] does not refer to UTF-8Y specifically, but it refers to non-BOM'ed UTF-8 as UTF-8N. -- leuce (talk) 09:01, 7 April 2009 (UTC)

BOM section not neutral

The section about the BOM is not neutral. It is written in a decidedly anti-BOM fashion and casts facts in that light only. Two arguments are given in the article about why a BOM is bad. These arguments are not referenced, and even if they were, they're argumentative. They are:

  1. This causes interoperability problems with software that does not expect the BOM. -- well, not having a BOM also causes interoperability problem, with software that expects or requires the BOM for identifying the file encoding.
  2. It removes the desirable feature that UTF-8 is identical to ASCII for ASCII-only text. -- no indication is given for why this would be desirable; don't forget that ASCII is not Latin-1 is not ANSI.
  3. For instance a text editor that does not recognize UTF-8 will display "" at the start of the document, even if the UTF-8 contains only ASCII and would otherwise display correctly. -- true, but if the text editor is not programmed to look for non-ANSI or non-ASCII characters, but it is programmed to look for a BOM, it will also display the file incorrectly, and not just the first three characters...
  4. ...and use an external UTF-8 text editor when editing text strings. However such a compiler would not accept the BOM, which must be removed manually. -- the reverse is also true, namely that if programs and scripting languages (such as AutoIt) expect the BOM, and the BOM was removed by a purist text editor, it has to be added manually.
  5. Programs that identify file types by special leading characters will fail to identify the UTF-8 files... -- sadly, there are references for this silly argument, so "it must be true", but programs that can't read BOM'ed UTF8 files typically have the same problem with UTF16 files, for the simple reason that those programs require a very specific type of file that users should not fiddle with unless they know what they are doing; besides, it is not an insurmountable problem to add recognition of UTF8 BOMs to such programs, if the willingness exist (it usually doesn't).
  6. Some Windows software (including Notepad) will sometimes misidentify UTF-8 (and thus plain ASCII) documents as UTF-16LE if this BOM is missing... -- this is bug is of the same type as the Unix programs' inability to read shebangs from a BOM'ed file.

-- leuce (talk) 08:24, 7 April 2009 (UTC)

What you call "anti-BOM", others might consider to be a recognition of the basic fact that while the BOM is highly useful in some cases, it's by no means an all-purpose solution, and in some contexts tends to create more problems than it resolves... Also, why would a text-editor that knows about UTF-8 require the presence of a BOM in its input, when the BOM is in general highly optional for UTF-8? AnonMoos (talk) 09:41, 7 April 2009 (UTC)
You say several times that a program that "expects a BOM" will break. But any such program will break if given an ASCII file, because there is no BOM there! This is *exactly* why the BOM is a bad idea, and why programs that expect it are broken. The very fact that you quote the "bush hid the facts" bug, for a string that contains NO 8-bit characters, should show why depending on the BOM is wrong. —Preceding unsigned comment added by 96.229.138.20 (talk) 08:14, 17 April 2009 (UTC)
Lots of programs *can* read UTF-8 files, *with no changes to the program*. This is a *huge* difference between UTF-8 and UTF-16, one you seem to not understand. UTF-16 can have a BOM because the program has to completely change it's logic to read it anyway! But a BOM in a UTF-8 is in many cases the one thing that *forces* the program to be changed, so it can skip it! Your argument about UTF-16 having problems is completely irrelevant or actually opposite what you intended. —Preceding unsigned comment added by 96.229.138.20 (talk) 08:23, 17 April 2009 (UTC)
Sorry I had to fix it. Any software that requires a BOM is obviously new and in a much better position to be fixed, likely somebody has the source code and understands it. The entire point of UTF-8 is that existing systems do not need to be rewritten. If you remove this you might as well use another byte-based encoding that does not bother being ASCII-compatible. —Preceding unsigned comment added by 96.229.138.20 (talk) 07:52, 28 April 2009 (UTC)
I'm fairly happy with the new version (after you backed out my previous attempt). I'm not trying to say the BOM is bad. What I am trying to prevent is novice programmers thinking that they should only decode UTF-8 if the BOM is there. This sort of thinking is extremely damaging to internationalization. I also tried to fix it to point out that the problem programs are *existing* ones as new programs should always be written correctly. Programs *should* be fixed to ignore the BOM if possible, I did not want to imply that they should not. Even if Notepad is fixed, BOM's will still appear due to conversion of UTF-16 files with them. —Preceding unsigned comment added by Spitzak (talkcontribs) 18:56, 30 April 2009 (UTC)
The point is that the Wikipedia is not the place to educate novices about the preferred way of doing something. This section of the article is about the current state of affairs. As such, it should be neutral. It should not try to convince a reader that X is wrong or Y is right. Let's keep it neutral (i.e. the version I put up just now). -- leuce (talk) 14:03, 3 May 2009 (UTC)
Spitzak, it's been apparent for a long time that you passionately dislike the BOM and UTF-16 and want the article to come out against them. Because you're willing to spend more time on the article than me or apparently anyone else, you've largely gotten your way. That doesn't make the article's current state appropriate. You need to take a step back and realize that the BOM is an imperfect solution for an imperfect world, as is UTF-8, as is UTF-16, as is Unicode. You might as well ask Israelis and Arabs to write left to right and Turks to stop using that dotless i as ask people to drop UTF-16 and the BOM. All of those would make text processing easier. None of them is going to happen. If you can't bring yourself to feel sympathy for character encoding conventions that differ from your own, please consider voluntarily recusing yourself from editing this article. -- BenRG (talk) 20:45, 30 April 2009 (UTC)
Will you PLEASE look for the word "existing" in all my recent edits and understand that I am talking about!!!!!
I'm sorry my rather flippiant listing of "aestetic" objections to the BOM (which was supposed to sound like I don't really agree) sounded like I was non-NPOV. I have removed these and tried to make it clear that it is ok, trivial, and expected that software should skip the BOM (which it is).
My main problem with all your edits is that you imply that a program that requires the BOM is correct (mostly by mentioning some kind of incompatability with software that "expects" the BOM). This sort of misinformation is incredibly damaging to I18N.Spitzak (talk) 01:00, 5 May 2009 (UTC)
I also want to say that of course UTF-8 is an imperfect solution. However there seems to be a concerted effort to make sure it is not a solution at all by making sure it is incompatible with existing software. I am not trying to make anybody drop UTF-16 or the BOM, I am trying to stop misinformed people from breaking the entire purpose behind UTF-8. —Preceding unsigned comment added by Spitzak (talkcontribs) 01:06, 5 May 2009 (UTC)
The point is that the Unicode documentation never states or implies that using a BOM is incorrect or that requiring a BOM or not requiring a BOM is correct or not correct for UTF8. So the idea that it is wrong for software to require a BOM, is a personal opinion (one that is shared by many people, particularly in the Linux world), but it remains an opinion. It is such a prevalent opinion that it needs mentioning in the article, but it is no the "only truth". The purpose of the section about the UTF8 BOM is to show how the BOM is a contentious issue, and to show why.
Some software requires a BOM -- it is a fact. There are also FLOSS software that will refuse to open a file if a UTF8 BOM is detected in it. Whatever we may think of such software (good or bad), the fact is that both exist and that their existence is proof that the UTF8 BOM is a thorny issue... and that is all that this section is about. The article is not there to help rid the world of the UTF8 BOM or programs requiring it. -- leuce (talk) 22:12, 14 May 2009 (UTC)
Requiring a BOM is not technically "incorrect", since it falls into an area of application program behavior which the UTF-8 standards specifications do not directly cover. However, it goes against declarations such as that "Use of a BOM is neither required nor recommended for UTF-8."Byte-order_mark#References -- AnonMoos (talk) 23:32, 14 May 2009 (UTC)

Size Comparisons

Here is the script I used to do the size comparisons. Be sure the file is saved with the .html extension or lynx will misbehave. This will also work on other OSes if you install bash, uniconv, gzip, perl, wc, and lynx. Note that a unicode aware version of wc might not report the file sizes correctly; this one was unicode oblivious and so reported the actual byte counts as character counts. Google reportedly trained their language translator with a large corpus of identical documents in many languages provided by the United Nations. If someone can find these documents, that might be a good test.

The wikipedia list of other language translations in the left sidebar throws the plain text numbers off a bit as that is 4320 bytes out of 24079.

#!/bin/bash
getsize()
{
wc $1 | perl -p -n -e "s/[0-9][0-9]* *[0-9][0-9]* *([0-9][0-9]*).*/\1/" | perl -p -n -e "s/ *//"

}

lynx --dump --nolist $1 >$1.plain

uniconv -out $1.plain.UTF-16LE -decode utf-8 -encode utf-16-le -in $1.plain
uniconv -out $1.plain.UTF-16BE -decode utf-8 -encode utf-16-be -in $1.plain
uniconv -out $1.UTF-16BE -decode utf-8 -encode utf-16-be -in $1
uniconv -out $1.UTF-16LE -decode utf-8 -encode utf-16-le -in $1

gzip <$1 >$1.gz
gzip <$1.UTF-16BE >$1.UTF-16BE.gz
gzip <$1.UTF-16LE >$1.UTF-16LE.gz
gzip <$1.plain >$1.plain.gz
gzip <$1.plain.UTF-16BE >$1.plain.UTF-16BE.gz
gzip <$1.plain.UTF-16LE >$1.plain.UTF-16LE.gz

size_utf8_xml=`getsize $1`
size_utf16BE_xml=`getsize $1.UTF-16BE`
size_utf16LE_xml=`getsize $1.UTF-16BE`

size_utf8_plain=`getsize $1.plain`
size_utf16BE_plain=`getsize $1.plain.UTF-16BE`
size_utf16LE_plain=`getsize $1.plain.UTF-16LE`

size_utf8_xml_gz=`getsize $1.gz`
size_utf16BE_xml_gz=`getsize $1.UTF-16BE.gz`
size_utf16LE_xml_gz=`getsize $1.UTF-16LE.gz`

size_utf8_plain_gz=`getsize $1.plain.gz`
size_utf16BE_plain_gz=`getsize $1.plain.UTF-16BE.gz`
size_utf16LE_plain_gz=`getsize $1.plain.UTF-16LE.gz`

let size_utf8_total=$size_utf8_xml+$size_utf8_plain+$size_utf8_xml_gz+$size_utf8_plain_gz
let size_utf16BE_total=$size_utf16BE_xml+$size_utf16BE_plain+$size_utf16BE_xml_gz+$size_utf16BE_plain_gz
let size_utf16LE_total=$size_utf16LE_xml+$size_utf16LE_plain+$size_utf16LE_xml_gz+$size_utf16LE_plain_gz

echo "<table border='3'>"
echo "<tr><th></th><th colspan='2'>XHTML</th><th colspan='2'>Plain, no hyperlinks</th><th></th></tr>"
echo "<tr><th>Encoding</th><th>Uncompressed</th><th>Compressed</th><th>Uncompressed</th><th>Compressed</th><th>total</th></tr>"
echo "<tr><td>UTF-8</td><td>$size_utf8_xml</td><td>$size_utf8_xml_gz</td><td>$size_utf8_plain<td>$size_utf8_plain_gz</td><td>$size_utf8_total</td></tr>"
echo "<tr><td>UTF-16BE</td><td>$size_utf16BE_xml</td><td>$size_utf16BE_xml_gz</td><td>$size_utf16BE_plain<td>$size_utf16BE_plain_gz</td><td>$size_utf16BE_total</td></tr>"
echo "<tr><td>UTF-16LE</td><td>$size_utf16LE_xml</td><td>$size_utf16LE_xml_gz</td><td>$size_utf16LE_plain<td>$size_utf16LE_plain_gz</td><td>$size_utf16LE_total</td></tr>

Whitis (talk) 17:47, 14 May 2009 (UTC)

To remove the bias of non-native sites, I went looking for popular chinese and indian sites. www.people.com.cn gave similar results. But not only were all the Indian websites in English, they didn't even have a button to change languages to one of the native languages. I went looking for another language that is not ideographic that is in the range that UTF-8 encodes as 3 bytes that might be impacted. Thai has just 127 characters in the poorly encoded area: http://th.wikipedia.org/wiki/Thai_alphabet was still significantly smaller as utf-8. The characters are encoded as "e0 b8 xx". Wikipedia's category links throw things off a bit. I tried some Thai websites. After trying some that were listed as Thai language only but not only were they in english, they didn't have a link to switch to thai. Found www.trekkingthai.com. Has a few english words down in the lower left corner but predominately thai. Even on the plain text, UTF-8 was dramatically better (3:1) - more so than the html. Lynx inserted filenames. Guess what, the Thai webmaster uses english filenames. ASCII has invaded the world and UTF-8's advantage encoding ASCII has a significant impact. Whitis (talk) 19:31, 14 May 2009 (UTC)


I changed back your last edit, since it would have tripled the size of the article (such raw data really belongs on Wikisource, not Wikipedia). Also, UTF16LE and UTF16BE only differ by a few bytes, so it's not too useful to include both (it would be better to include relevant single-byte encodings). And it would be nice if you used Wikitables instead of HTML tables... AnonMoos (talk) 23:43, 14 May 2009 (UTC)
I don't think this section is appropriate. It doesn't respect Wikipedia:Avoid self-reference. There is no reason to give such massive emphasis to our own site. Superm401 - Talk 05:56, 15 May 2009 (UTC)
I once wrote in "Compared to UTF-16": ASCII includes spaces, numbers, newlines, some punctuation, and XML markup, so it is not unusual for ASCII characters to dominate. For example both the Japanese and the Korean UTF-8 article on Wikipedia take more space if saved as UTF-16 than the original UTF-8 version. That should be enough, maybe a few more examples with sizes, including Hindi, but big tables are probably not needed. --BIL (talk) 07:07, 15 May 2009 (UTC)

Correctness

I saved hi:India as the plain text (Save as.. and use text file) as UTF-8 and as UTF-16LE (Windows standard) and the result was UTF-8 – 67 kB and UTF-16LE – 58 kB. So I doubt the correctness of the http://hi.wikipedia.org/wiki/India (Hindi) table. I also doubt the correctness of the values for ja:UTF-8 in Japanese. I simply do not believe that the Japanese plain text take 11000 bytes in UTF-8 and 24264 bytes in UTF-16. Either double check the calculation with another method than the perl script or erase the table. If proving a mathematical evidence with a computer program the program must be proven to be correct. --BIL (talk) 07:04, 15 May 2009 (UTC)

It isn't even mathematically possible for UTF-16 to take more than twice the space of UTF-8, and the tables show that happening frequently. Clearly something was broken. -- BenRG (talk) 11:37, 19 May 2009 (UTC)

Ideograms vs. logograms

I replaced the three occurences of 'ideogram' to their correct name, 'logogram', since clearly (and in one case definitely) CJK characters were ment, which are logographic, not ideographic. JAL - 13:48, 2006-06-06 (CET)

Normal date format: 13:48, 06 June 2006

External links

The second UTF-8 test page is more comprehensive than the first, which is now unmaintained. Should we just remove the first?

Guessing at date: 12:00, 12 December 2007