Talk:GB 2312

Latest comment: 5 years ago by HalfCap in topic EUC-CN conversion issues

Proofreading (2011) edit

"The value of the first byte is from 0xA1-0xF7 (161-247), while the value of the second byte is from 0xA1-0xFE (161-254). Hence, like UTF-8, it is possible to check if a byte is part of a two-byte construct when using EUC-CN."

These two sentences don't make sense to me. How does the second sentence follow from the first?

It's incorrect as far as I can tell. It's not possible to check if a byte is the tail of a two-byte construct, with UTF-8 you can because a tail byte starts with binary 10 while a heading byte starts with binary 11.
"Compared to UTF-8, GB2312 (whether native or encoded in EUC-CN) is also more storage efficient, since Chinese characters are limited to a maximum of two bytes each, while UTF-8 uses at least three bytes."
That line is incorrect as well. UTF-8 has 2048 two-byte sequences. I'll go ahead and fix the article. --Scandum (talk) 00:20, 8 May 2011 (UTC)Reply
CJK Unified Ideographs (Unicode block) has a minimum code point of 4E00, well outside of the double-byte UTF-8 range. Always consider the context: GB 2312 is a Chinese encoding. --Artoria2e5 emits crap 13:24, 29 September 2016 (UTC)Reply

External links modified edit

Hello fellow Wikipedians,

I have just modified 2 external links on GB 2312. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 12:18, 9 October 2017 (UTC)Reply

EUC-CN conversion issues edit

"To map the code points to bytes, add 158 (0x98) to the row number of the code point to form the high byte, and add 158 column number of the code point to form the low byte. The row number is the code point integer divided by 94, and the column the code point modulo 94.

For example, if you have the GB2312 code point 4566 ("外", which means foreign), the high byte will be 4566/94+158=206=0xCE, and the low byte will come from 4566%94+158=212=0xD4. So, the full encoding is 0xCED4=52948."

This section does not appear to be correct. The example given of code point 4566 (row 45, column 66, see character at https://archive.org/details/GB2312-1980/page/n17) is converted to EUC-CN by adding 160 (0xA0) to each row and column value, resulting in a new two byte value of 0xCDE2 (45 + 160 = 205 (0xCD), 66 + 160 = 226 (0xE2)) The current page value of 0xCED4 is another character (卧), code point 4652, row 46, column 52).

Both of these values (0xCDE2 and 0xCED4) and the characters they represent can be verified by viewing the Unicode to GB2312 conversion table at https://web.archive.org/web/20160303230643/http://cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.html and looking at characters U+5916 (外) and U+5367 (卧) and seeing the values listed underneath each.

Additionally, the constants given in the current section as 158 and 0x98 are different values. 158 in decimal is 0x9E and 0x98 is 152.

It also looks like before the edit for 15 December 2016, this section was correct. HalfCap (talk) 23:29, 29 November 2018 (UTC)Reply

I went ahead and made the changes based on the information above HalfCap (talk) 14:39, 10 December 2018 (UTC)Reply