Talk:CESU-8

Latest comment: 3 years ago by Spitzak in topic Comments

rewrite edit

I have started a rewrite at CESU-8/temp as per the instructions in the copyvio template i am reporting it here. Plugwash 13:51, 14 July 2005 (UTC)Reply

  • Temp page has replaced the main article. RedWolf 03:34, July 22, 2005 (UTC)

examples edit

Can yu please give an example of a string encoded in CESU-8 and in which case it is treated in a special way? -- Nichtich 00:23, 27 October 2005 (UTC)Reply

Advantages edit

What are the advantages to UTF-8? --Apoc2400 09:15, 6 December 2006 (UTC)Reply

I can think of three
1: if used for serialisation of strings in languages where strings are natively UTF-16 it won't break if someone decides to use string types for something other than valid UTF-16 data.
2: if sorted using a byte orientated sort the result will be the same as using a word orientated sort on UTF-16
3: conversion between CESU-8 and UTF-16 is simpler than conversion between UTF-8 and UTF-16


But as the name suggests the main reason it is used is compatibility with old software. Software that was built to use UCS-2 internally and UTF-8 inexternally and that does not explicitly reject surrogate codepoints can be made to store surrogates (for supplementry characters) by feeding its external interfaces with CESU-8 data (which will be converted to UTF-16 for storage internally). Plugwash 23:07, 7 December 2006 (UTC)Reply

Comments edit

The reference to MySQL utf8mb3 is a bit weird, since MySQL utf8mb3 does not support characters outside BMP while CESU-8 will support them through utf8 encoded surrogate pairs. Bernt (talk) 07:24, 29 August 2013 (UTC)Reply

utf8mb3 "supports" surrogate pairs exactly the same way as CESU-8 (ie it turns them into two 3-byte sequences). Whether this is considered support for non-BMP is questionable, but the answer would be the same for both utf8mb3 and CESU-8.Spitzak (talk) 17:11, 31 August 2013 (UTC)Reply
Don't think so. Below is from MySQL 5.6
mysql> select hex(convert(_utf32 0x10400 using utf8mb3));
+--------------------------------------------+
| hex(convert(_utf32 0x10400 using utf8mb3)) |
+--------------------------------------------+
| 3F                                         |
+--------------------------------------------+
1 row in set (0.01 sec)

mysql> select hex(convert(_utf16 0xD801DC00 using utf8mb3));
+-----------------------------------------------+
| hex(convert(_utf16 0xD801DC00 using utf8mb3)) |
+-----------------------------------------------+
| 3F                                            |
+-----------------------------------------------+
1 row in set (0.00 sec)

mysql> select hex(convert(_utf32 0x10400 using utf8mb4));
+--------------------------------------------+
| hex(convert(_utf32 0x10400 using utf8mb4)) |
+--------------------------------------------+
| F0909080                                   |
+--------------------------------------------+
1 row in set (0.00 sec)

mysql> select hex(convert(_utf16 0xD801DC00 using utf8mb4));
+-----------------------------------------------+
| hex(convert(_utf16 0xD801DC00 using utf8mb4)) |
+-----------------------------------------------+
| F0909080                                      |
+-----------------------------------------------+
1 row in set (0.00 sec)

Bernt (talk) 14:51, 31 March 2014 (UTC)Reply

Your test is converting to UTF32 internally, then converting UTF32 to utf8mb3/4. You need to trigger the code that goes DIRECTLY from UTF-16 to utf8mb3, unless the programmers are total masochists it will not check if a surrogate half is "paired" or not before converting it to 3 bytes.Spitzak (talk) 17:00, 16 September 2020 (UTC)Reply
I think I finally get it. CESU-8 is what you get when you convert to UTF-8 from UTF-16 with supplementary characters present, but as though the input is UCS-2. So said test, using UTF-16 character 0xD801DC00 as above, would look like:
mysql> select hex(convert(_ucs2 0xD801DC00 using utf8mb3));
+----------------------------------------------+
| hex(convert(_ucs2 0xD801DC00 using utf8mb3)) |
+----------------------------------------------+
| EDA081EDB080                                 |
+----------------------------------------------+
1 row in set (0.00 sec)

Six bytes. I get it now. What I don't know is what happens when you just try to encode a supplementary plane character directly, without specifying UCS-2; it appears that you get garbage, as the _utf32 example above shows. So, while I now see that utf8mb3 can be used to store CESU-8 encoded data, it seems to me that you'd have to really want to do it, as opposed to being something that happens organically or automatically. I'm going to poke around at it some more, but I'd still welcome a demonstration where a supplementary character is automatically stored as two bytes in utf8mb3 (without specifying UCS-2). Ivanxqz (talk) 10:02, 18 September 2020 (UTC)Reply

This is simply programmer oversight, they failed to write the UTF32->utf8mb3 converter so that UTF16->UTF32->utf8mb3 produces the same result as UTF16->utf8mb3. All of these CESU/utf8mb3/WTF8 variations are from a mindset where "Unicode" means 16-bit code units and it is never converted to anything other than UTF-16/UCS-2 or UTF-8, in particular at no point do they think of a non-BMP character as anything other than 2 supplementary characters. When you think of it that way, CESU and utf8mb3 are identical. The only real difference is the document for CESU tries to encourage the UTF32<->CESU converter to be written to consume/produce non-BMP characters, as you can see that they did not do this for utf8mb3 in the UTF32->utf8bm3 conversion case. WTF8 is a somewhat more thoughtful attempt to preserve the primary benefit of these encodings, which is that they can store unpaired surrogate halves (necessary to store all possible Windows filenames). It specifies that paired surrogate halves must be translated to non-BMP characters always.Spitzak (talk) 19:23, 18 September 2020 (UTC)Reply