Talk:Tai Tham (Unicode block)

Latest comment: 3 years ago by RichardW57 in topic Medial la and sakot + la

refs edit

@RichardW57: I'm trying to understand why the refs in the history section needed a workaround. This format is used in hundreds of Unicode block articles. What problem was the old ref causing in this specific article? Thanks. DRMcCreedy (talk) 03:18, 5 February 2020 (UTC)Reply

If you go back to, for example, the version of 3 February, you will see that what is listed as Reference 3 in the list of references is numbered '4' in the text. They are linked, but the reference number in the text is inconsistent with the number in the list of references.
I suspect there's a bug in the citation numbering component, but I don't have any good hypothesis for what it is. It's possible that mixes of dispersed (<ref>) and collected (<refs>) references don't work together well in the presence of named groups; it might even be something specific to the #ref mechanism used by {{note}}.
Do any other of the 'hundreds of Unicode block articles' contain references in less mechanical* sections? I couldn't find any examples of such sections, which made me worry whether the additional sections belong in this article. I was torn between this article and Tai Tham script. I didn't like the idea of a new article 'Tai Tham (encoding)', though it will probably be needed if the script expands to a second block. I've translated and partially adapted them from the Northern Thai Wikipedia incubator article whose name transliterates as 'Tua Mueang' for the benefit of those who don't read Northern Thai. That article functions as the equivalent of this article and Tai Tham script.
Arguably, I should try to convert the new text to using <refs>.
*I appreciate that, despite appearances, the history section is not just handle turning.
-- RichardW57m (talk) 11:17, 5 February 2020 (UTC)Reply
I've cut this file down in a sandbox to demonstrate the problem. The problem is triggered by having both refs and the 'note' argument to {{Infobox Unicode block}}. -- RichardW57m (talk) 15:45, 5 February 2020 (UTC)Reply
Thanks, I now understand the problem. I've played around with various fixes and think that using explanatory footnotes (Template:efn and notelist) is probably the best approach. DRMcCreedy (talk) 19:05, 5 February 2020 (UTC)Reply
I've found an example where I would expect the problem to occur but it doesn't - Latin-1_Supplement_(Unicode_block). None of my guesses explain why that page works with the refs parameter but this page (still) doesn't. --RichardW57m (talk) 13:34, 6 February 2020 (UTC)Reply

Final Code Points edit

Would it be appropriate for me to split the 'final code points' column into distinct rows so as to record the code points affected by each document?

One of the documents (N3384) changed the interpretation of <U+1A60 SAKOT, U+1A37 BA>. As it affects the coding map from base consonants to subscript consonant forms, is it appropriate to say that this change affects U+1A37, and not the control-type character U+1A60? I similarly intend to record the disunification (in the same document) of proposed <U+1A60, U+1A48 HIGH SA> as affecting U+1A48 (and leading to U+1A5E SIGN SA). -- RichardW57m (talk) 11:17, 5 February 2020 (UTC)Reply

I would only split final code point ranges if the code points were initially proposed in separate mature proposals. And even then it's a gray area. (Emoji are the most complicated in this matter.) The code point column isn't saying that each specific code point is discussed in each of the listed documents. Going into that level of detail would take too much effort and would cause a huge number of duplicate documents being listed in the table. That being said, N3384 seems to be the first proposal for U+1A5D TAI THAM CONSONANT SIGN BA and U+1A5E TAI THAM CONSONANT SIGN SA. So I'd support splitting U+1A5D-E out if you think that would be useful. DRMcCreedy (talk) 19:52, 5 February 2020 (UTC)Reply
I've reverted the edit because the same code point (like U+1A20) can't show up on multiple rows. In the hundreds of Unicode block histories, once a reader finds the desired code point(s) they can stop searching. Having the same code point(s) on multiple rows creates a new system that is incompatible with the existing charts. However, documents can show up multiple times if they cover code points that were proposed in separate mature proposals. For example, if you wanted to split U+1A5D-E out, N3453 would appear once for "U+1A20..1A5C, 1A60..1A7C, 1A7F..1A89, 1A90..1A99, 1AA0..1AAD" and again for "U+1A5D..1A5E". But I wouldn't chop up (repeat) the main proposal just to get down to a very granular, code point by code point, analysis. DRMcCreedy (talk) 04:42, 6 February 2020 (UTC)Reply
So how does one handle set being A added in 6.0, set B being added in 7.0, and then point x from set A having its properties changed (as opposed to added to) in 9.0? I strongly suspect we'll find examples of exactly that in Malayalam. Do we end up with a table such as the following:
6.0 A add_A
9.0 change_x
7.0 B add_B
No, the format of the table records the version the code point(s) was added, not the date of each change or discussion, so the format is below:
6.0 A add_A
change_x
7.0 B add_B
DRMcCreedy (talk) 23:56, 6 February 2020 (UTC)Reply
For example, the last 7 rows of the Tai Tham history have nothing to do with Unicode 5.2 - they're modifying or clarifying a property that didn't exist then. Incidentally, the last 4 L2/08 entries are out of order. The logical order, as demonstrated by their content, is, L2/08-037[R2], L2/08-073, L2/08-003, L2/08-318. I had been planning to put them in order this morning.
I'm not sure how you're proposing to handle avowedly immature proposals, such as L2/99-245. That's missing the logograms, punctuation, local tone marks, digits and at least 3 base characters. It belongs only to history; it does nothing to explain what the Unicode characters are. (TUS often fails in that respect as well - putting a base character in a chart doesn't always define it, as has been seen with font implementations of <SAKOT, BA> and <SAKOT, PA>.) -- RichardW57 (talk) 08:47, 6 February 2020 (UTC).Reply
My intent was to say that I wouldn't split ranges just because some characters were not in the initial (not yet mature) proposals. DRMcCreedy (talk) 23:56, 6 February 2020 (UTC)Reply
Richard, I think it's really outside the scope of Wikipedia to try to trace the history of individual characters document by document. I understand that you have a particular interest in Thai Tham, and want to make this article as complete as possible, but we should not be adding a level of detail here that we as Wikipedians do not consider to be appropriate for all 300+ block articles. Personally, I think that the History tables already go into too much detail by listing peripheral documents such as UTC/WG2 minutes and dispositions of comments etc., and I would not support adding details of what characters are addressed in each individual document. Sure, that information is useful to serious historians of character encoding (probably not many more than the three of us!), but it is way beyond what should be included in a Wikipedia article which should aim to provide a general overview to the general reader. BabelStone (talk) 09:36, 6 February 2020 (UTC)Reply
I think the scope of the history is an issue. If we take the last three entries, the question is whether we should record the history of a comment in IndicPositionCategory.txt on the placement of U+1A69 and U+1A6A. While it has been hoped that the concept in Unicode was intuitive, the idea can come unstuck. Patrick Chew has put forward the notion that the glyphs of RA, <SAKOT, RA>, RA HAAM (as a consonant) and perhaps MEDIAL RA are all the same character. Indeed, the last two could conceivably have been implemented as RA combined with a control character or similar; it was decided that it wasn't worth floating that idea before the UTC. (Kourilsky had two viramas in his system.) Now, the fact of variability in placement belongs in a description of the writing system. The claim that a spacing right matra does not need separate encoding from a matra below merits recording; it is a Unicode decision, not a feature of the writing system, and is not a totally obvious decision.
Some of what your talking about seems to veer into original research which isn't the purpose of Wikipedia. DRMcCreedy (talk) 23:56, 6 February 2020 (UTC)Reply
Non-obvious claims are candidates for recording if the UTC accepts them and documents them. --RichardW57 (talk) 00:37, 7 February 2020 (UTC)Reply
So, if someone wants to confirm that a right matra for /u/ is encoded as U+1A69 despite its UCD properties, where should they turn? I wouldn't trust the comment to remain in the UCD - the XML file for Unicode properties does not contain it. If one wants to know about U+1A60 KA as a Unicode character, scanning the 'final code points' column with the level of detail I had would tell the reader that the last 7 lines held nothing of interest.
The Unicode Standard itself should be the ultimate reference. If there's a vital piece of information missing it should be addressed there. DRMcCreedy (talk) 23:56, 6 February 2020 (UTC)Reply
I have noticed that the table seems designed for recording initial complements and additions to a block. We may need some stock footnotes such as 'property change' to adorn the code points in the list. Another one that may be appropriate is 'disunification'. It's a shame the bug with 'refs' parameter doesn't allow us to populate these in advance. The example that springs to my mind is the introduction of Malayalam chillus. A Latin-1 example is the disunification of LATIN SMALL LETTER VISIGOTHIC Z from LATIN SMALL LETTER C WITH CEDILLA also comes to mind, though I notice that this hasn't been recorded as a change to the Latin-1 Supplement block.
Generally I've only included changes that affect the code charts or the Unicode Character database (UCD) although often property changes didn't pose a big enough change to document. People should be using the UCD directly for things like properties. DRMcCreedy (talk) 23:56, 6 February 2020 (UTC)Reply
Character boundaries are also significant, and won't necessary show up with a single code chart. Changes in the way of encoding a user-perceived character also matter. And how do you propose readers get the reason for a property change out of the UCD? Finding them is hard enough. --RichardW57 (talk) 00:37, 7 February 2020 (UTC)Reply
One trick I saw in that block is to add the characters affected to the title of the referenced document. It's helpful, but naughty. --RichardW57m (talk) 13:20, 6 February 2020 (UTC)Reply
I have resorted to adding the code point(s) to the document title but I've tried to avoid it... usually I did it for the CJK blocks where the ranges are huge. DRMcCreedy (talk) 23:56, 6 February 2020 (UTC)Reply

N2042 Position edit

I propose putting this in the date position of Unicode Technical Report #3, of which it is essentially an extract. --RichardW57m (talk) 14:41, 6 February 2020 (UTC)Reply

I don't understand this comment. Put what exactly where? Are you referencing the date parm of the citation template? DRMcCreedy (talk) 23:56, 6 February 2020 (UTC)Reply
N2042 is an extract from Unicode Technical Report #3 Revision 1, which is earlier than any other document in the history, and its relative immaturity shows. It's interesting in what it shows as apparently user-perceived characters, but is manifestly unfit as a proposal. For Tai Tham, it's a step back from the two Chinese proposals you list at the start. As UTR#3 Revision 1 is an earlier document, I suggest making N2042 the first document in the list. --RichardW57 (talk) 00:15, 7 February 2020 (UTC)Reply
Done. --RichardW57m (talk) 10:09, 10 February 2020 (UTC)Reply

Medial la and sakot + la edit

I'd like to know which information is correct. The following sentences on this page cite information that is not actually in the reference.

  • "ᩆᩦ᩠ᩃ (IPA: [siːn]) is encoded as <U+1A46 HIGH SHA, U+1A66 SIGN II, U+1A60 SAKOT, U+1A43 LA> but ᨸᩖᩦ (IPA: [piː]) is encoded as <U+1A38 ᨸ, U+1A56 MEDIAL LA, U+1A66 SIGN II>.[3](Section 4)."
  • "ᨸᩖ᩠ᨿ᩵ᩁ is actually encoded <U+1A38 HIGH PA, U+1A56 MEDIAL LA, U+1A60 SAKOT, U+1A3F LOW YA, U+1A75 TONE-1, U+1A41 RA>[3](Section 14.9)" (bold added).

In Section 4, the source discusses the use of the medial la in kla. It doesn't recommend that users use SAKOT + LA for coda l's and MEDIAL LA for l's in clusters. Section 14.9 actually says "For example, /plian/, which combines 14.2 and 14.3: ... = PA + SAKOT + LA + SAKOT + YAL + TONE-1 + RA," suggesting that SAKOT + LA be used (bold added).

The reference in the first sentence applies to the second clause. I think the syllable (let alone word) kla does not actually exist in Northern Thai - it isn't in the MFL or NTDPLM. Perhaps I should reference the first clause to Section 14.5 of the proposal. The Tai Tham text in the second example is wrong. I will have to fix it, or find a better example. --RichardW57 (talk) 09:47, 30 July 2020 (UTC)Reply
Now fixed on this page. And on the original of these notes on the encoding, at nod:ᨲ᩠ᩅᩫᨾᩮᩬᩥᨦ#ᩀᩪᨶᩥᨣᩰᩫ᩠ᨯ. --RichardW57 (talk) 21:48, 30 July 2020 (UTC)Reply
It is not the part of a Unicode proposal to propose visual changes to the orthography of languages. There are some rare instances of MEDIAL LA being used for coda 'l'. Possibly it is because the words in question are visibly Pali, and the author was generalising from the usual form of the subscript in Pali being MEDIAL LA. <SAKOT, LA> for etymological (sometimes non-etymological!) medial /l/ is quite common - some people seem not to use MEDIAL LA at all. --RichardW57 (talk) 09:47, 30 July 2020 (UTC)Reply

Also, reading the source doesn't help me with this because in Section 15, SAKOT + LA is shown in examples with onset ha (e.g., Examples #22 and 28, เหลือ and หลาย in Thai respectively). Richard's proposal implies that the medial la (which I know is argued to be called something else) is used in words like banana (kluay). So, what is the recommendation? It makes sense to me to always use the medial la instead of SAKOT + LA because it is one character fewer. The answer to this question will help us update the information on this page, and it will also help me create Northern Thai entries in Wiktionary. @RichardW57m and RichardW57: your input would be helpful. @Octahedron80: please join the discussion. Thank you! --A.S. (talk) 06:10, 30 July 2020 (UTC)Reply

I think both medial L and sakot L are correct because I found both forms on the same word (as medial). About H-L (ᩉᩖ and ᩉ᩠ᩃ), they are used both either. But I choose medial L to be main entry of Wiktionary, otherwise it will be alternative form. --Octra Bond (talk) 06:18, 30 July 2020 (UTC)Reply

@Octahedron80 and Alifshinobi: Some people think that 'ᩉ᩠ᩃ' is a bad spelling, like seperate in English. However, even the MFL makes distinctive use of ᨸᩖ and ᨸ᩠ᩃ. It writes ᨸ᩠ᩃᩢᨯ to indicate a svarabhakti vowel - or is it truly two syllables, as indicated by its saying อ่าน"ปะหลัด"? Some people seem to use MEDIAL LA as a silent letter, and <SAKOT, LA> as a sounded letter. (The combination 'll' from Pali may have its own rules.) As far as encoding is concerned, MEDIAL LA and <SAKOT, LA> have distinct families of glyphs, though if one goes far enough back one may find intermediate, ambiguous glyphs. Therefore, what one records in Wiktionary must agree with what one finds in the source material. 'ᩉ᩠ᩃ' is unobjectionable as a sequence of Unicode characters, just as 'seperate' is. It may be under-represented at Wiktionary, because some dictionaries won't include it and therefore one should provide quotations for it if some awkward bugger refuses to accept it as 'obvious'. --RichardW57 (talk) 09:47, 30 July 2020 (UTC)Reply
Going to what I think of as the rustic alternative, the banana word is spelt ᨠᩖ᩠ᩅ᩠᩶ᨿ in the MFL (p16 of Revision 1) and ᨠ᩠ᩃ᩠ᩅ᩠᩶ᨿ on p20 of คู่มือการเรียนหนังสือตัวเมือง(แบบง่าย) by Sa-ngop Dechawong (foreword dated 2001; I have a copy from the 6th impression, dated 2005). --RichardW57 (talk) 14:05, 30 July 2020 (UTC)Reply
@RichardW57 and Octahedron80: Thank you! ᨿᩥ᩠ᨶᨯᩦᨩᩣ᩠ᨲᨶᩢ᩠ᨠᨤᩢ᩠ᨷ! --A.S. (talk) 17:08, 30 July 2020 (UTC)Reply