Talk:Marshallese language/Archives/2020/May

Latest comment: 4 years ago by Gilgamesh~enwiki in topic IPA algorithm milestone

IPA overhaul, again?

I had worked hard to overhaul Marshallese IPA into a convention that is not only imminently readable and matches the regular patterns in the orthography, but could be generated algorithmically in a Scribunto/Lua module. But the Marshallese Reference Grammar (MRG) makes even this no longer certain. Since then, I've given this new knowledge some thought in how to craft better IPA vowels. Here's three examples using eaktuwe and ākūtwe. You should see very quickly the kinds of different approaches I'm showcasing.

Option eaktuwe {yaktiwey} ākūtwe {yakitwey} Notes
Normal Enunciated Normal Enunciated
#1 æɡɯ̯dˠuwɛ ɛ̯ɑk tˠuwɛ æɡɯdˠu̯wɛ æɡɯtˠ wɛ Current module output. Painstakingly crafted to match orthographical patterns.
#2 æ͜ɑɡɯ̯dɯ͜uwɔ͜ʌ͜ɛ æ͜ɑk tɯ͜uwɔ͜ʌ͜ɛ æ͜ɑɡɯdɯ̯͜u̯wɔ͜ʌ͜ɛ æ͜ɑɡɯt wɔ͜ʌ͜ɛ Full diphthong mode, similar to the previous way this article handled IPA. Accurate, but hard for readers to digest.
#3 æ̯ɐɡɯ̯du̜wʌɛ̯ æ̯ɐk tu̜wʌɛ̯ æ̯ɐɡɯdu̜̯wʌɛ̯ æ̯ɐɡɯt wʌɛ̯ An alternative approach, using semivowels to mark the glides and the vowel midpoints of the diphthongs.

It's inescapable that option #1 remains the most readable. It is also...arbitrary, only because the orthography's norms are arbitrary, selected by committee to preserve literary tradition and continuity in a relatively more phonologically regular manner, rather than directly reflecting pronunciation itself. It's a good concise orthography, which is important to teach and promote reading and writing, and in that sense an IPA that is similar to the orthography is not the worst idea. I think the best approach may end up being somewhere between option #1 and option #3. The fewer diphthongs the reader has to digest, the better. And yet I must admit there's a certain elegance in a phonemic palindrome like Wōtto {wettew} being phonetically transcribed as [wɔ̜ttˠɔ̜w] rather than [wʌttˠɔ], with the vowel [ɔ̜] being the diphthongal midpoint between [ʌ] and [ɔ]. At least, in the short term, option #1 does no real harm, per se, as an imitation of it still produces intelligible Marshallese. ...So, I'm divided. I'm sure, at least, that I can improve the module's current algorithm with corrections from the MRG for dialect reflexes and non-initial realizations of {yi'y}, and yet I'm still tempted to write whole new functions from scratch. (Is it obvious by now that I'm OCD?) - Gilgamesh (talk) 19:21, 26 January 2020 (UTC)


Okay... Thinking about it, option #1 isn't exactly that bad, but it can be modified to be more regular, and not rely on byzantine vowel reflex tables. It can use some simpler rules of thumb, taking only into account whether a consonant is a glide or non-glide, what kind of glide it is (sometimes), the F1 of the vowels, and whether a VGV sequence is interpreted as a long monophthong. Some ideas I have:

  • Make phonemically mirrored word pairs (like eo̧ {yaw} and wā {way}) and phonemic palindromes (like Wōtto {wettew}) have identical or near-identical IPA in reverse, as much as is practical. There will be unavoidable exceptions, having to do mainly with issues like, consonant cluster assimilations, glide-glide epenthetic vowel F2 and regressive vowel harmony assimilation of mid vs. mid-high vowels in VGV sequences.
  • Let any velar glide phoneme {h} have unconditional F2 priority over vowels on either side, as the module already does.
  • At the beginning or ending of a word with {w}, always use [w] unless the neighboring vowel is a true monophthongal [ɒ]. This extends what the module already does when a word starts with {w}, but extends it to areas not previously covered.
  • If a vowel is low and needs a neighboring palatal semivowel, use [æ̯], not [ɛ̯], as {ya} spellings with ⟨e⟩, like ea and eo̧, are a committee-informed orthographic convention, not a phonetic one. (I imagine that retaining ea, eo̧ made them easier to read at a quick glance than *āa, *āo̧ proved to be.)
  • In G-G clusters, the epenthetic vowel always takes the F2 of the glide to the left, as the module already does.
  • In /CʲVCʷ, CʷVCʲ/ patterns, unless the vowel is a member of a long monophthong, always use a back unrounded vowel reflex.
    • eo̧ {yaw} as [æ̯ɑw], not [ɛ̯ɒ].
    • wā {way} as [wɑæ̯], not [wæ].
    • Nuwio̧o̧k {niwiyawak} as [nʲɯwɯjɒːk] or possibly [nʲuːjɒːk], not [nʲuwiɒːk].
  • In CVC patterns without glides where either consonant is velarized, also always use a back unrounded vowel reflex.
    • dik {dik} 'small' as [rʲɯk], not [rʲik].
    • kil {kil} 'skin' as [kɯlʲ], not [kilʲ].
  • In CVG and GVC sequences where one of the consonants is velarized and one consonant is a glide and the other is not a glide, always use a vowel F2 corresponding to the glide, except (for contrast's sake) if the glide is {w}, in which case always use a back unrounded vowel paired with a surfaced [w].
    • ean̄ {yag} 'north' as [æŋ], not [æ̯ɑŋ] or [ɛ̯ɑŋ].
    • wōt {wȩt} 'rain' as [wɤt], not [wot].
  • Mark identified long monophthongs as long monophthongs, even if the orthography obscures this.
    • kweet {kʷȩyȩt} 'octopus' as [kʷeːtˠ], not [kʷɤetˠ].
    • kiur {kiyirʷ} 'storm' as [kiːrʷ], not [kiurʷ].
    • kiurin {kiyirʷin} 'storm of' as [kiːrʷɯnʲ]. It's worth noting that the orthography rules would regularly spell this sequence as kiirun anyway if the word were not analyzed as the suffixed kiur-in.
    • bweo {bȩyȩw} 'pile of husks near husking stick' as [pˠeːw], not [pˠeo] or [pˠeow].
    • bweoan {bȩyȩwan} 'pile of husks near husking stick, of' as [pˠɛːwɑnʲ]. Again, the orthography rules would regularly spell this bweewan if not for the way it was grammatically analyzed.
    • awa {hawah} 'hour' as [ɑwɑ], not [ɑ̯ɒːɑ̯], because {h} is still given absolute priority over vowel F2.
  • Velarized labial and rounded consonant off-glides are indeed purely orthographic, so no need to mark then with [w] after a consonant reflex.

These are just some tentative ideas for now. - Gilgamesh (talk) 21:53, 26 January 2020 (UTC)


The more I think about this, the more I think maybe long monophthongs needn't be uniquely detected. From what I understand from reading Choi (1992), the long vowel contour (where the nucleus is held for proportionally longer) is not always predictable. kweet {kʷȩyȩt} 'octopus' as [kʷɤetˠ] may be an adequate transcription after all. - Gilgamesh (talk) 19:28, 27 January 2020 (UTC)


So, I was tentatively thinking:

Vj Vw VCʲ VCˠ VCʷ
jV e (e̯)ɤ (e̯)ɤ(w) eCʲ eCˠ (e̯)ɤCʷ
ɰV ɤ(e̯) ɤ ɤ(w) ɤCʲ ɤCˠ ɤCʷ
wV (w)ɤ(e̯) (w)ɤ o (w)ɤCʲ oCˠ oCʷ
CʲV Cʲe Cʲɤ Cʲɤ(w) CʲeCʲ CʲɤCˠ CʲɤCʷ
CˠV Cˠe Cˠɤ Cˠo CˠɤCʲ CˠɤCˠ CˠɤCʷ
CʷV Cʷɤ(e̯) Cʷɤ Cʷo CʷɤCʲ CʷɤCˠ CʷoCʷ

Same equivalent IPA symbols for other vowels, adjusted for vowel height. Symbols in parentheses are omitted if vowel symbols of appropriate F2 neighbor, except in certain situations where F1 differs at a glide boundary, like in debweiu {debȩyiw} [rʲʌbˠejɯw], not [rʲʌbˠeɯw].

This vowel transcription system is radically regular, with the same vowel and semivowel sequences when phonemes are reversed, and recognizing, as per MRG, that [e͡o, o͡e] are actually short triphthongs in all positions. It also heavily relies on [ɤ] as a neutral vowel in many situations where secondary articulations differ, making much less frequent use of [e] or [o], and helping emphasize the vertical vowel system nature of the phonology in a friendlier way than purely phonemic transcription. It treats /w/ as [w] in all surfaceable positions, recognizing how little it actually varies when surfaced, and treats /j/ as a semivowel of appropriate F1 height, recognizing how much it does vary when surfaced.

One drawback is that the triphthongal approach effectively obliterates the smooth notation of certain long monophthongs, rendering a name like Jo̧o̧n {jawan} 'John' as [tʲɑwɑnʲ] (or perhaps [tʲɑɒ̯ɑnʲ]) instead of [tʲɒɒnʲ] (or [tʲɒːnʲ]).

If this idea is too regular (deviating from orthographical similarity in too many ways), there could still be other ways to adjust it. - Gilgamesh (talk) 02:58, 30 January 2020 (UTC)


Now that I've pondered on my last comment, that's some of the most arbitrary OR, isn't it? I honestly don't know which approach is most wise. - Gilgamesh (talk) 06:47, 30 January 2020 (UTC)


Let's try this again, with more practicality, less ambition. Something somewhat more resembling the orthography-informed module algorithm, but with equivalent rules for all vowel heights, and without different rules for different obstruent primary articulations, since those areas can be simplified rather trivially.

↓→ eCʲ eCˠ eCʷ ej ej(e) eɰ(e) ew ew(e)
Cʲe CʲeCʲ CʲɤCˠ CʲɤCʷ Cʲe Cʲɤ Cʲɤ(w) Cʲo
Cˠe CˠɤCʲ CˠɤCˠ CˠɤCʷ Cˠe Cˠɤ Cˠɤ(w) Cˠo
Cʷe CʷɤCʲ CʷɤCˠ CʷoCʷ Cʷɤ(e̯) Cʷe Cʷɤ Cʷo
je eCʲ eCˠ (e̯)ɤCʷ e (e̯)ɤ (e̯)ɤ(w) (e̯)o
(e)je eCʷ e(w)
ɰe ɤCʲ ɤCˠ ɤCʷ ɤ(e̯) ɤ ɤ(w)
(e)ɰe
we (w)ɤCʲ (w)ɤCˠ oCʷ (w)ɤ(e̯) (w)e (w)ɤ o
(e)we oCʲ oCˠ o(e̯)

This still requires logic to detect long monophthongs, but...c'est la vie. The two greyed out cells are a reminder of the non-triviality of determining long monophthongs when two conflicting candidate spans overlap. Take the theoretical sequence {yȩyȩwȩw}: Does the long monophthong exist in {yȩyȩw(ȩw)} [eːo], in {(yȩ)yȩwȩw} [eoː], or in neither [eɤo]? The module's algorithm currently favors whichever sequence comes last, and this may actually still be most correct. But the issue with long monophthong detection is whether the speaker associates and treats it as a long monophthong, which is apparently not phonemic enough that Bender saw fit to indicate it in his phonemic orthography, and yet it's still semantic enough that it's reflected in the orthography. Another complication from Choi (1992) is that even some superficially clear-cut examples of long monophthongs, like naaj {nahaj}, are apparently not as often given the the F2 curve associated with true long monophthongs, and instead behave more like two short trajectories: [nʲæ͡ɑɑ͡ætʲ] rather than [nʲ(æ̯)ɑː(æ̯)tʲ]. I can only conclude that there is no reliable way whatsoever to detect long monophthongs purely by phonemic code and algorithm. - Gilgamesh (talk) 19:45, 30 January 2020 (UTC)


Sorry, Dr. Bender, but I may need to modify just a bit of your work to make this algorithm work.

Right now, the input code is an ASCII-based variant of Bender's orthography. It supports sequences including a b d e & h i j k kw l lh lw m mh n ng ngw nh nw p r rw t w y yi'y 'yiy, with other apostrophes as disambiguators. With the addition of three more code tokens, (h) (w) (y) (or something similar), the nuclei of long monophthongs can semantically be marked as such in a way that the algorithm can detect, and distinguish VGV sequences as curved F2 trajectories rather than as generic angular F2 trajectories. And yet...editors would sometimes have to use their informed judgment and make a semantic determination which kind of trajectory is involved. And any guess made entirely by an editor is...OR.

So...I potentially solve one problem...to reveal another. Maybe it's no wonder Dr. Bender decided not to mark this distinction in his phonemic orthography at all. - Gilgamesh (talk) 21:49, 30 January 2020 (UTC)


Of course, if I were really tempted...

↓→ eCʲ eCˠ eCʷ ej ej(e) eɰ(e) ew ew(e)
Cʲe CʲeCʲ CʲɘCˠ CʲɤCʷ Cʲe Cʲɤ Cʲɤ(w) Cʲo
Cˠe CˠɘCʲ CˠɤCˠ Cˠo̜Cʷ Cˠɘe̯ Cˠe Cˠɤ Cˠo̜(w) Cˠo
Cʷe CʷɤCʲ Cʷo̜Cˠ CʷoCʷ Cʷɤ(e̯) Cʷe Cʷɤ Cʷo
je eCʲ (e̯)ɘCˠ (e̯)ɤCʷ e (e̯)ɤ (e̯)ɤ(w) (e̯)o
(e)je eCˠ eCʷ e(w)
ɰe ɤCʲ ɤCˠ ɤCʷ ɤ(e̯) ɤ ɤ(w)
(e)ɰe
we (w)ɤCʲ (w)o̜Cˠ oCʷ (w)ɤ(e̯) (w)e (w)ɤ o
(e)we oCʲ oCˠ o(e̯)

...And if it is indeed possible to have a long monophthong like in naaj with an angular F2 trajectory rather than a curved F2 trajectory, then that would seem to imply...it actually may be possible for [ɑ̯, ʌ̯, ɤ̯, ɰ] to surface after all, as potentially in [nʲɐɑ̯ɐtʲ]? Which could hypothetically allow for...

↓→ eCʲ eCˠ eCʷ ej ej(e) eɰ(e) ew ew(e)
Cʲe CʲeCʲ CʲɘCˠ CʲɤCʷ Cʲe Cʲɘ(ɤ̯) Cʲɤ Cʲɤ(w) Cʲo
Cˠe CˠɘCʲ CˠɤCˠ Cˠo̜Cʷ Cˠɘe̯ Cˠe Cˠɤ Cˠo̜(w) Cˠo
Cʷe CʷɤCʲ Cʷo̜Cˠ CʷoCʷ Cʷɤ(e̯) Cʷe Cʷo̜(ɤ̯) Cʷɤ Cʷo
je eCʲ (e̯)ɘCˠ (e̯)ɤCʷ e (e̯)ɘ(ɤ̯) (e̯)ɤ (e̯)ɤ(w) (e̯)o
(e)je eCˠ eCʷ e(ɤ̯) e(w)
ɰe (ɤ̯)ɘCʲ ɤCˠ (ɤ̯)o̜Cʷ (ɤ̯)ɘ(e̯) (ɤ̯)e ɤ (ɤ̯)o̜(w) (ɤ̯)o
(e)ɰe ɤCʲ ɤCʷ ɤ(e̯) ɤ(w)
we (w)ɤCʲ (w)o̜Cˠ oCʷ (w)ɤ(e̯) (w)e (w)o̜(ɤ̯) (w)ɤ o
(e)we oCʲ oCˠ o(e̯) o(ɤ̯)

...Though I can't be certain that's a safe logical leap to make. In any event, at this point this post is mostly just musing. And I'm still trying to experiment with writing a new phonetic IPA algorithm. - Gilgamesh (talk) 09:55, 31 January 2020 (UTC)


@Austronesier: I know it may be premature to fully form an opinion before you've read at least some of the MRG I've been consulting. However, at this point of ongoing algorithm refinement, in an effort in which it's not always easy to sit still and do nothing while code can be continually improved upon in small ways, I don't necessarily need a perfectly informed opinion at this time. Sometimes an opinion simply grounded in relatively more of the common sense I can sometimes lack, can be helpful. To offer ideas, to challenge ideas, or to check outright flaws in my judgment. So, do you have any tentative thoughts on this? Any of it—the phonetic reliability (or lack thereof) of the orthography, or IPA overhaul, or concepts towards a new phonetic output model, or even my triangle diagrams from the previous section, about the diphthong midpoints as a potential basis for a more accurate phonetic vowel transcription. - Gilgamesh (talk) 07:00, 2 February 2020 (UTC)


It seems [ɛ̯] rather than [æ̯] at the beginning of {ya} is correct after all. This is because of the reflexive preference for unstressed vowels, epenthetic vowels and semivowels in Marshallese to not be fully open vowels, but mid vowels instead. It seems the only circumstance where open unstressed vowels are considered truly stable are when they directly neighbor an implied velar approximant—though in that situation, it's actually more of an implied pharyngeal approximant because the vowel is so open. - Gilgamesh (talk) 23:35, 6 February 2020 (UTC)


I haven't commented here in a while, but I've been working on the algorithm almost every day. It was a far simpler matter when the orthography could be more or less relied upon to inform vowel realizations. That certainty is gone. And it doesn't help that I've been multitasking a lot more lately which has slowed my progress. - Gilgamesh (talk) 06:36, 25 February 2020 (UTC)


All right. Tentatively, the idea I'm working with involves 20 different vowel allophones, representing the midpoint of short diphthongs:

  • Vowels of the second column go between a palatalized consonant and an unrounded velarized consonant.
  • Vowels of the third column go between a palatalized consonant and a rounded velarized consonant, in addition to between two unrounded velarized consonants.
  • Vowels of the fourth column go between an unrounded velarized consonant and a rounded velarized consonant.
Front Central Back Semiround Round
Close i ɨ ɯ u
Mid-close e ɘ ɤ o
Mid ɛ̝ ə ʌ̝ ɔ̝͑ ɔ̝
Open æ ɐ ɑ ɒ̜ ɒ

For the sake of visual simplicity and to avoid adopting new unique vowel symbols, the second column can be replaced with centralized versions of the first column. This is especially helpful because [ə], though a true mid vowel symbol, looks more like [e] than like [ɛ], and because [ɐ] also looks too much like [e]. Additionally, the third row doesn't need its uptack if there are no truly mid-open vowels to oppose them.

Front Central Back Semiround Round
Close i ï ɯ u
Mid-close e ë ɤ o
Mid ɛ ɛ̈ ʌ ɔ̜ ɔ
Open æ æ̈ ɑ ɒ̜ ɒ

This works well between two full consonants, so that [rʲïk] is a good representation of dik.

But between vowels, a problem arises when you have sequences like [ɛ̯ɛ̈] or [ɛ̈ɛ̯]—they appear visually noisy. In these cases, it seems safe to coalesce them into [ɛ]. Similarly, sequences like [ɑ̯ɐ] or [ɑ̯æ̈] can be coalesced into [ɑ]. So:

  • aelōn̄ {hayȩlȩg} 'atoll' [ɑ̯ɐelʲɘŋ ɑ̯æ̈elʲëŋ ɑelʲëŋ]

For surfaced semiconsonants (as per emphasis of Marshallese Reference Grammar (or MRG, by Bender et al, 2016), we're not calling them glides), no difference of height is recognized for rounded semiconsonants (only [w]), and only three of the four heights are recognized for front semiconsonants ([ɛ̯, e̯, j], not [*æ̯]). Semiconsonants surface if they cannot silently coalesce into both a neighboring vowel allophone's F2 and the max() of both neighboring vowels' F1.

Now where Choi (1992) comes in: Allophonically tense vowels, where a vowel's F2 lingers longer towards a semiconsonant nucleus. In the case of long monophthongs (CVGVC or CVGC), this is where the F2 forms a bow curve rather than two straight lines. I'm thinking of also applying this as half a bow curve to a phrase's final vowel (-CVG), so that the two vowels of a CVGCVG sequence harmonize in pronunciation, but this partially supposition on my part and may not be defensible with sources. And yet for a word like jojo {jȩwjȩw}, the pronunciation [tʲoːdʲo] appears more elegant and to the point than [tʲoːdʲɤw].

Not entirely sure what to do with Āne-jaōeōe {yanȩy-jahȩyhȩy}, which the MRG brings up because the unusual onomatopoeic phonetic sequence of ōeōe necessitating the consideration for its standardized spelling. Some of the sources generally comment that the velar semiconsonant ⟨h⟩ never surfaces, but some of those same sources also claim semiconsonants only surface at the beginnings and ends of words, and we already have so many examples where we know this isn't true. Ōeōe is already an unusual sequence—could it have an equally unusual pronunciation like [ɤ̯eːɤ̯e]? Or, if you take the "no surfaced ⟨h⟩" rule to its fullest conclusion, must it be something else like [ɤe̯(ɤ).ɤe̯]? My previous assumption—that this had to be [ɤe̯ɤːe̯]—was at least partially supposition on my part, but the assumption may no longer be supportable.

Since the MRG has made so many helpful notes about Marshallese stress patterns, I will no longer treat the epenthetic vowel of C(V)GV sequences as a potential monophthong with the following full vowel, but instead treat it as a break, so that even if Mājeej {majȩyȩj} is [mʲæ(ˈ)dʲeːtʲ], Mājej {majyȩj} is [(ˈ)mʲædʲe̯(ˈ)etʲ] (enunciated [mʲætʲ etʲ]). Syllable stress is not that phonemic in Marshallese, but syllable anti-stress is, as a mora without a vowel (like the first consonant of a cluster) cannot take stress and thus forces syncope to occur.

And then there's ⟨j⟩. There are notes the MRG left about the nature of weejej (Marshallese ⟨j⟩-lisping) that makes it more complicated than I originally thought. I knew that the phoneme /tʲ/ has many possible allophones, but generally assumed that they varied by the speaker, with a common general preference for either [zʲ] (especially among women) or [dʲ] (especially among men) between vowels. This is why supposition can be dangerous: The MRG clarified that this varies not only by person, but by word, and increases in frequency the more bilingual a speaker is. It's considered a kind of lisping because the pronunciation ⟨j⟩ becomes irregularly articulated and voiced on a word-by-word basis, effectively becoming multiple different phonemes within one speaker's phonology to reflect the different original sounds of a loanword's origin in languages like English or Japanese; furthermore, weejej can also creep into words of native Marshallese origin. But the united phonemic model for Marshallese still has only one phoneme for ⟨j⟩, and it should have no more unique phonetic symbols than any of the other obstruents. For simplicity, I'm thinking [tʲ] for voiceless and [dʲ] for voiced, harmonizing with its velarized counterpart ⟨t⟩ [tˠ, dˠ].

That leaves allophonic voicing. Up to this point, there's been a common wisdom expressed in some of the sources that obstruents at the beginning or end of a phrase as voiceless and that obstruents between vowels are voiced (or at least partially so). The MRG doesn't distinguish initial and medial voicing. If anything, it instead draws a contrast between obstruents that precede vowels and obstruents at the end of a phrase: The former are said to be more similar to English voiced consonants than to English voiceless consonants, and it's certainly true that Marshallese obstruents are at least not usually aspirated in any circumstance. But only final obstruents are said to be fully voiceless—and unreleased. This raises a question: Is a word like babbūb {babbib} 'butterfly; moth' better expressed as [pˠɑppˠɯpˠ] or as [bˠɑpbˠɯpˠ] (implicitly [b̥ˠɑpˠb̥ˠɯpˠ])? Or maybe an intermediate approach like [bˠɑppˠɯpˠ] is more appropriate? - Gilgamesh (talk) 09:31, 25 February 2020 (UTC)


All right, I've been giving this some more thought. As much as I didn't want to introduce new vowel symbols, the central vowel symbols (second of five columns) may be acceptable if they more closely resemble their front counterparts in the first column. I considered this because [æ̈ ɛ̈ ë ï], after further consideration, look artificial as central vowels. [a ɜ ɘ ɨ] still bear a resemblance to their front counterparts and are central, with [a] being implicitly [ä]. I will experiment and see if this improves readability without appearing artificial.

Front Central Back Semiround Round
Close i ɨ ɯ u
Mid-close e ɘ ɤ o
Mid ɛ ɜ ʌ ɔ̜ ɔ
Open æ a ɑ ɒ̜ ɒ

Some notes on specific symbols:

  • Though the sound [ɜ] is assigned to is actually a mid central vowel [ə], the selection of [ɜ] harmonizes both with the appearance of [ɛ] and the selection of mid-open symbols for the third vowel as a whole despite them all being mid. I considered [e̞ ə ɤ̞ o̜̞ o̞] for that entire row instead, but why make this more complicated than it has to be? Longstanding convention, pre-dating the official adoption of a full set of central vowel symbols, allows [ɜ] to stand in for [ə] if it is a stressed vowel, which this Marshallese vowel allophone can be given it is a short diphthong's midpoint.
  • The symbol [a] canonically represents the open front unrounded vowel, but longstanding convention allows it to represent the open central unrounded vowel (implicitly [ä]), to the point where a proposed separate dedicated vowel symbol for this—[ᴀ]—has been repeatedly rejected for addition in the IPA, most recently in 2012. As the open central unrounded vowel's article explains, if separate symbols are needed for front, central and back open unrounded vowels, it is more typical to use [æ, a, ɑ] for that contrast. I had considered [ɐ] for the central symbol instead on account of it having the same canonical height as the symbol [æ], but [ɐ] still has its own readability issues, and Marshallese doesn't have a specifically near-open or fully open vowel phoneme anyway—just an open one, which its allophones' openness can vary between fully open and [ɛ̞, ɔ̞].

- Gilgamesh (talk) 03:09, 2 March 2020 (UTC)

@Gilgamesh: That's quite legible, actually.
A couple points. ⟨ə, ɐ⟩ were originally intended to be reduced vowels, and that's still how they're often used even if the IPA no longer specifies them as such. There are still telltale signs, such as not being defined for rounding. So IMO it's good that you've replaced them. Also, a ⟨ɘ, ə⟩ contrast would be difficult to read.
[a] is a central vowel, not front. The IPA still claims it's front, but that's only because they persist in defining vowels with that obsolete trapezoid. The vowel trapezoid is defined by place of articulation, but no-one defines vowels that way anymore. It's all by formant, and when mapping by formants, the prototype [a] is a central vowel, and lower than [æ, ɑ]. (Which for all I know is the case for Marshallese as well.)
So I don't think the changes you're defending above need much justification. — kwami (talk) 12:31, 10 March 2020 (UTC)
Thank you, I appreciate your review. I'm still working on this, on a much more intermittent level than before, because other things have been keeping me busy. But I still really want to continue working on this. - Gilgamesh (talk) 23:26, 11 March 2020 (UTC)

With life settling into the new normal, I've been trying to work on wiktionary:Module:mh-pronunc again, with a generous helping of notes from the Marshallese Reference Grammar (2016) and Dr. Bender's other publications over the decades to help improve it. Some points that have become clear:

  • In Bender (1968), he said that he could not personally discern a surfacing of {h}, but that his native Arabic-speaking colleague at the time insisted that it was a clear equivalent to ayin. In other words, a pharyngeal approximant, which English-speaking ears can't always easily recognize. While I've not seen fit to put [ʕ] or [ʕ̞] for every instance of the semiconsonant in phonetic transcription, I tried tweaking the semiconsonant F1 algorithm so that it surfaces as the max() of neighboring vowels if it is {y} or {w}, but the min() of neighboring vowels if it is {h}. This incidentally solves an oddity seen in modern Marshallese orthography: {yihakʷayal} 'a quarrel' is spelt iakwāāl (not the otherwise expected *iūakwāāl), and {wetewbahiy} 'a motorbike' is spelt otobai (not the otherwise expected *otobaūi), and these are both perfectly logical if the {h} in these words is surfacing as [ɑ̯ ~ ʕ̞] (because of the neighboring {a}) rather than [ɯ̯ ~ ɰ] (because of the neighboring {i}). While {h} demonstrably does become [ɯ̯ ~ ɰ] if it only neighbors {i}, it seems like a clear abstract understanding of the phoneme (in phonemic transcription) is that it is more of a /ʕ/ than a /ɰ/, and yet still grouped as a dorsal consonant because of its allophones [ɑ̯~ʕ̞, ʌ̯~ʁ̞~ɤ̯, ɯ̯~ɰ] depending on the semiconsonant's surface F1. There seems to be one exception I can see in the orthography: The island named Āne-jaōeōe {yanȩy-jahȩyhȩy}, where under this modified understanding we should expect a spelling of Āne-jaeōe instead. (OR warning.) But this can be explained semantically by the name of the island specifically referencing (and prefixing) the onomatopoeia ōeōe {hȩyhȩy}, as the island's name means "the island that makes the sound ōeōe." When prefixed by ja-, the term may actually take on a surface pronunciation closer to jaeōe than jaōeōe, but the morphemes are spelt separately to preserve the word's etymology. (End OR.)
  • In the MRG's section on excrescent (epenthetic) vowels (p. 74-78), it finally resolves that there is no audible difference between full vowels and epenthetic vowels. Though the latter are omitted in enunciation, they are always present in uninterrupted speech, their F1 is predictable, but they are not perceived to exist by their speakers—it's just a quirk of the language that is deeply ingrained in its native speakers. As such, epenthetic vowels' surface phonetic transcription is to be treated the same way as full vowels. Not that there is never a way to tell these apart, because:
  • In the MRG's section on (syllable) stress (p. 100-102) and the following section on analyzing words according to their moras (p. 102-110), it explains that Marshallese syllable stress is actually very predictable. Marshallese evolved from an earlier Micronesian language phase that had words whose only syllable structure was only CVCV, CVCVCV, etc., not unlike Gilbertese or Polynesian languages, and in this ancestral language a word's stress was always on the penultimate (second-to-last) syllable with secondary stress on alternating syllables before that: /ˈCVCV, CVˈCVCV, ˌCVCVˈCVCV/, etc. But somewhere in the evolution of the Marshallese language, most of the unstressed vowels became elided, and phoneme structures shifted to /ˈCVC, CVˈCVC, ˌCVCˈCVC/, etc. And when an epenthetic vowel is inserted between two consonants of a consonant cluster, it is not to be understood as new vowel inserted where none existed, but as the resurfacing of a vowel that would otherwise be silent. Marshallese is still a mora-timed language, and the MRG says that consonant-only morae without a vowel have the same rhythm as morae with vowels, with the consonant held for the same amount of time in memory of the lost vowel: /ˈCV.C̩, CVˈCV.C̩, ˌCV.C̩ˈCV.C̩/. Only, because the final mora is unreleased or only minimally released in pausa, it's more like /ˈCV(C), CVˈCV(C), ˌCV.C̩ˈCV(C)/, meaning the final stress is on the (truly) final syllable, if still on the penultimate mora. Furthermore, it is no longer strictly the case that the final stressed syllable gets a word's primary stress, as primary stress can now also fall on the penultimate stressed syllable (the fourth or fifth to last mora), like /ˈCV.C̩ˌCV(C)/, though no further back in a word than that. (For the sake of simplicity of transcription, if a consonant-only mora cannot receive syllable stress, it makes no sense to transcribe it as its own syllable, so /ˈCVCˈCV(C)/ is probably fine.)
    But even then, it's not 100% that simple, because the Marshallese language prohibits consonant-only morae from ever receiving syllable stress. Which syllables have stress are still predictable from the end of the word, but in sequences of /ˈCVC/, /ˈCVCV/ or /ˈCVCCV/, with the third pattern creating a syncope where stressed syllables are three morae apart instead of two. While this syncope is allowed in the language, it is also considered unstable in long-term development, with /ˈCVCCVˈCVC/ words in particular wanting to realign as /CVˈCVCˈCVC/ in a syllable structure that still favors stressed syllables being two morae apart. This explains why the inflected verb eaktuwe {yaktiwey} 'to unload (intransitive or transitive)' has the alternative synonymous (and more stable) form ākūtwe {yakitwey}.
    Additionally, where /ˈCVCVˈCVC/ forms have a second vowel that can sound identical to equivalent /ˈCVCˈCVC/ forms with an epenthetic vowel in the same position, the language is littered with doublet stems reflecting this: bōļāāk {beļayak} 'a flag' has the synonymous homophonous doublet bōļeak {beļyak}, and in the opposite semantic direction, the atoll Jālwōj {jalwȩj} 'Jaluit' has the synonymous homophonous doublet Jālooj {jalȩwȩj}. And though these doublets are homophonous in isolation, they are not homophonous when suffixed, because /ˈCVCVˈCVC + -VC/ has the rhythm /CVˈCVCVˈC‿VC/, and /ˈCVCˈCVC + -VC/ has the rhythm /ˈCVCCVˈC‿VC/. And these example words remind me to another point:
  • Marshallese long vowels are indeed purely semantic and not in any way phonemic, and a suffixed word's traveling stress accent can actually turn /ˈCV(G)V/ into /CVˈ(G)V/ and vice versa, provided both are full vowels rather than epenthetic vowels. Nevertheless, Marshallese vowels transition smoothly even across syllable boundaries, so words like piik {piyik} 'pig' have a stress pattern of {piˈyik}, but a Western ear would be hard-pressed to detect two syllables (hearing only [piːk]), which is part of why the older Marshallese orthography often wrote these two-vowel sequences as just one vowel: bik. Because of this smoothness, I've found myself tempted to try to present words like this as having the stress pattern {ˈpiyik}, but the MRG makes it clear that that is not how Marshallese syllable stress works.
  • The surfacing of semiconsonants becomes proportionally more pronounced the larger the F2 leap is between its closest neighboring consonants. In particular, {y} surfaces the hardest if the nearest neighboring consonants are rounded, and {w} surfaces the hardest if the nearest neighboring consonants are palatalized. As such, though it can be semantically convenient to phonetically analyze words like jojo {jȩwjȩw} 'a flying fish' as [tʲoːdʲo], the true pronunciation is closer to [ˈtʲɤwɤˈdʲɤw]. When the orthography was standardized, it was deemed simpler to spell {jȩw} as jo rather than *jōw. And yet its exact phonetic reverse, {wȩj} 'beauty; you', is to be regularly written wōj rather than oj, unless it forms the second half of a semantic long vowel as in Jālooj {jalȩwȩj}, despite even Jālooj actually being pronounced more like *Jālōwōj.
    The modern orthography can be a thing of beauty, but in many respects its rules were always knowingly arbitrary. And yet while there's an appeal in keeping orthography rules simple, it seems more difficult to justify in phonetic IPA transcription. See, even though we have an interest in avoiding unreadably convoluted phonetic transcriptions that only confuse readers, many of the orthographically-informed phonetic transcriptions we've recently been using are just plain wrong and unphonetic to the language. So even though the orthography settled on jo which suggests [tʲo], it is only most correct to go with [ˈtʲɤw]. There is still room for compromise on the issue, as the transitional phonetic symbols I proposed in previous comments [a ɜ ɘ ɨ ɒ̜ ɔ̜ u̜] may suffice if neighboring two full consonants, but they hinder readability of phonetic transcriptions if you write them next to semivowels of the same F1 height. For example, em̧m̧an {yem̧m̧an} 'it is good' may be fairly accurately represented as [ˈɛ̯ɜmˈmˠanʲ] with only the transitional vowels I mentioned, but it's undeniable that [ˈɛmˈmˠanʲ] is more readable without introducing any ambiguity. And with the word kweet {kʷȩyȩt} 'an octopus', [kʷeˈetˠ] as per the orthography may be too radically simplistic, and [kʷɤˈe̯ɘtˠ] may be too unreadably convoluted, but [kʷɤˈetˠ] may represent an appropriate compromise.

- Gilgamesh (talk) 04:08, 30 April 2020 (UTC)

@Austronesier: I know we haven't discussed this topic in a while, and then all the bumpy events started happening. I'm curious—were you ever finally able to obtain a copy of Marshallese Reference Grammar? - Gilgamesh (talk) 10:00, 3 May 2020 (UTC)

@Gilgamesh: Yes, I have bought it as an ebook, but haven't read much of it yet. I plan to extend the Grammar section based on it. Sorry I haven't caught up yet with your latest ideas, but you know that I tend to do many different things at the same time. –Austronesier (talk) 16:08, 3 May 2020 (UTC)
@Austronesier: Yes, I'm well aware of that. And I do rather wish I had an ebook copy in addition to my physical copy. I'm glad I bought my copy, but a physical book can be unwieldy to consult when I have no desk and am using a laptop. - Gilgamesh (talk) 16:28, 3 May 2020 (UTC)

One thing I don't quite understand about vowel and semiconsonant F2 in Marshallese. Because {y} is front and unrounded, {h} is back and unrounded, and {w} is back and rounded, one would think there would be a triangular relationship between these secondary articulations:

w
/ |
j - ɰ

I mean, going by the IPA vowel trapezium, it seems like a reasonable analysis. And of course, concerning the question of diphthong midpoints, this would seem logical, right?

u
/ |
(ʉ̜) (u̜)
/ |
i - (ɨ) - ɯ

The problem is, in Marshallese, the midpoint between [i] and [u] is not [ʉ̜]—the MRG makes it crystal clear that it's actually [ɯ], suggesting a different relationship between the vowel allophones.

u
|
(u̜)
|
i - (ɨ) - ɯ

And since this is effectively two-dimensional now, it can be represented instead like this.

  • i - (ɨ) - ɯ - (u̜) - u

And so, a relationship between diphthong allophone and midpoint allophone becomes clear:

i i͡ɯ i͡u ɯ͡i ɯ ɯ͡u u͡i u͡ɯ u
i i͡ɨ͡ɯ i͡ɯ͡u ɯ͡ɨ͡i ɯ ɯ͡u̜͡u u͡ɯ͡i u͡u̜͡ɯ u
ʲ‿i‿ʲ ʲ‿ɨ‿ˠ ʲ‿ɯ‿ʷ ˠ‿ɨ‿ʲ ˠ‿ɯ‿ˠ ˠ‿u̜‿ʷ ʷ‿ɯ‿ʲ ʷ‿u̜‿ˠ ʷ‿u‿ʷ

This much is known, and I've discussed it previously on this talk page in now-archived sections, even if I improvised on transitional vowel symbols like [ɨ] and [u̜] using (as close as possible) canonical IPA symbols for central and semi-rounded vowels. What I didn't quite get is how this forms any kind of natural two-dimensional F2 scale instead of the triangle or right angle I assumed before, as it would seem that Marshallese [u] would necessarily have to be even further back than [ɯ] despite the IPA vowel trapezium indicating them as positional equivalents, differing only in rounding. But then more recently I saw this:
 
An alternative arrangement of IPA vowels more closely aligned by formant, and it indicates that [ɯ] and [u] have a different F2 after all. If [ɯ] is back, then [u] is somehow further back than that. I admit I'm not 100% sure what to make of all this or how to describe it in relation to Marshallese phonology. Nevertheless, in my most recent drafts of the module's phonetic algorithm, I've been calculating vowel F2 as the raw average of the F2 of their nearest consonants: 0.5 * (left F2 + right F2). And it works. It doesn't always produce the prettiest phonetic results when converted to IPA...

  • jebkwanwūjo̧ {jebkʷanwijaw} /tʲɛpˠkʷænʲwitʲæw/ [ˌtʲɜbˠɔ͑ˈɡʷɑnʲɯwɯˌdʲɑw] 'coconut oil used for frying'

...but it still captures the midpoints in a way accurate in principle to how the MRG describes them. - Gilgamesh (talk) 14:04, 5 May 2020 (UTC)


So, I just now had a thought. If we can actually take this notion of a one-dimensional F2 scale seriously under the assumption that [u] is indeed further back than [ɯ], then it may be possible to use fewer symbols for all midpoints, and simplify the second and fourth columns:

Front Central Back Semiround Round
Close i ï ɯ ü u
Mid-close e ë ɤ ö o
Mid ɛ ɛ̈ ʌ ɔ̈ ɔ
Open æ a ɑ ɒ̈ ɒ

Very orthogonal and easier to memorize. The only reason I'm thinking [a] can still be used is that it's already an intuitive IPA symbol, simpler than [æ̈], though [ä] may not be out of the question. - Gilgamesh (talk) 16:00, 5 May 2020 (UTC)

Okay, that was...unsightlier than I expected. Too many diaereses, too many dots. Not a visual improvement. - Gilgamesh (talk) 16:09, 5 May 2020 (UTC)


All right, so currently, in the module's sandbox draft, the vowel and semivowel (surfaced semiconsonant) phonetic symbols look like this.

Vowels   Semivowels
Front Central Back Semiround Round Front Back Round
Close i ɨ ɯ u j ɰ w
Mid-close e ɘ ɤ o ɤ̯
Mid ɛ ɜ ʌ ɔ͑ ɔ ɛ̯ ʌ̯
Open æ a ɑ ɒ͑ ɒ ɑ̯

Some notes, admittedly sprinkled with OR:

  • For [ɒ͑ ɔ͑ u͑], I put the less-rounded diacritic above the vowel instead of the usual below, so that it doesn't risk being misread for a non-syllabic diacritic used in [ɛ̯ ɑ̯ ʌ̯ ɤ̯].
  • In my current algorithm draft, {y} and {w} between two vowels assumes the max() of the F1 articulations of the vowels, but for reasons I already mentioned previously, {h} is assuming the min() of the F1 articulations of the vowels. And if a semiconsonant's surface articulation is the equivalent approximant of any of its neighboring vowels, my algorithm unsurfaces (elides) it, so [eji] becomes [ei], etc.
  • It is not 100% clear that there is no [æ̯], but I use [ɛ̯] for both the mid and open surfaced front semiconsonant because the orthography uses ⟨e⟩ for both, though in some places ⟨ea⟩ is in practice interchangeable with ⟨ā⟩ for [ɛ̯a(Cˠ)] (eakto vs. ākto). I know that the orthography is, in many ways, arbitrary by convention and [æ̯a(Cˠ)] actually could be more correct in these positions, but I have one reasons to believe it may be [ɛ̯a(Cˠ)]: Ean̄ and iōn̄ are doublet synonyms meaning 'north.' Phonemically they are {yag} and {yi'yȩg} respectively, but I noticed the pattern that iōn̄ [jɘŋ] has a higher semivowel transitioning to a full vowel one step lower, and ean̄ would follow the same pattern but with a starting point two steps lower if the pronunciation were [ɛ̯aŋ]. In any event, even if [æ̯] and [ɛ̯] exist separately, they never phonetically contrast as they are both allophones of plain {y}.
  • Technically, [ɛ̯] and [e̯] do not phonetically contrast either, and I could conceivably use [e̯] instead of both [æ̯] and [ɛ̯], but I separate [ɛ̯] and [e̯] in appropriate contexts in respect to Marshallese mid-vowel harmony. So not only do [ɛ̯] and [e̯] not contrast, but the semiconsonant-separated two-vowel sequences [ɛ.ɛ, e.e] cannot contrast either, and the sequences [ɛ.e, e.ɛ] are prohibited by phonotactics anyway because of that mid-vowel harmony.
  • [e̯] and [j] can phonetically contrast because [e̯] can only be an allophone of {y}, while [j] can be an allophone of either {y} (neighboring {i}) or {yi'y} (not neighboring {i}).
  • There is no phonetic contrast between [*ɒ̯, *ɔ̯, *o̯, *u̯] as allophones of {w}, so [w] is used for all four.
  • I'm using [ɑ̯, ʌ̯, ɤ̯, ɰ] as allophones of {h}. None of them phonetically contrast, just like the allophones of {w}, and I could conceivably use one symbol for all of them. But because of {h}'s apparent exotic min() F1 behavior, and the uncertainty of whether the semiconsonant can best be phonemically analyzed as /ʕ/ or /ɰ/, I've been keeping separate allophone symbols for now.
    • I've been tempted to replace [ɑ̯] with [ʕ] (pharyngeal approximant), and [ʌ̯, ɤ̯] with [ʁ] (uvular approximant), because these consonants have a semivowel relationship with those vowels in the same way [j, ɰ, w] have a semivowel relationship with [i, ɯ, u]. But what hinders [ʕ, ʁ] notation is the same thing that apparently prevented Dr. Bender from being able to perceive {h} surfacing even as his Arabic-speaking colleague could—these consonants are unknown to English and Japanese, the most recent influential foreign languages in contact with Marshallese, and English or Japanese learners of Marshallese or Marshallese speakers familiar with English or Japanese will not intuitively grasp the semivowel relationship between [ʕ, ʁ] and [ɑ̯, ʌ̯~ɤ̯]. Why does this matter? Because one of the biggest recurring problems with Marshallese IPA on Wikimedia wikis—and the reason I've spent so much time and effort trying to improve it—is that it can be very hard to read. Also, the symbol /ɰ/ is more widely used in linguistics papers concerning Marshallese, whereas [ʕ, ʁ] are not.
    • There's also the distinct possibility that even if {h} was perceived to surface by an Arabic-speaking colleague of an English-speaking linguist back in 1968, it may no longer do so now, as Marshallese and its phonotactics continue to be influenced by generations of speakers learning and becoming fluent in English. But that's speculation outside the scope of this project. Either Bender's colleague was right and {h} is primarily ayin, or {h} can be considered not to surface at all. I'm leaning towards it being able to surface, if at least because semiconsonants routinely surface when isolated by vowels whose F2 differ from them, and surface even more strongly the greater the F2 transition is.

So, any thoughts? Different ideas of how these vowel and semivowel symbols should be displayed?

Continuing on. At this point I really have no idea how to represent long monophthongs with an F2 bow curve as described in Choi (1992), or how to reliably detect long monophthongs in a way that matches their occurrence the spoken language. Since I started redrafting the phonetic algorithm more recently, I've mostly been trying to avoid making any assumptions in regard to that topic. Not doing so can result in some unlovely-looking sequences:

Spelling   Phonemic   Phonetic   Semantic
ʲVːʲ ʲVːˠ ʲVːʷ ˠVːʲ ˠVːˠ ˠVːʷ ʷVːʲ ʷVːˠ ʷVːʷ
ii {iyi} ɨi ɨjɨ ɨjɯ ɯi ɯjɨ ɯjɯ
ee {ȩyȩ} ɘe ɘe̯ɘ ɘe̯ɤ ɤe ɤe̯ɘ ɤe̯ɤ
{eye} ɛː ɛɜ ɛʌ ɜɛ ɜɛ̯ɜ ɜɛ̯ʌ ʌɛ ʌɛ̯ɜ ʌɛ̯ʌ ɛː
āā {aya} æː æa æɑ aɛ̯a aɛ̯ɑ ɑæ ɑɛ̯a ɑɛ̯ɑ æː
ūū {ihi} ɨɰɨ ɨɯ ɨɰu͑ ɯɨ ɯː ɯu͑ u͑ɰɨ u͑ɯ u͑ɰu͑ ɯː
ōō {ȩhȩ} ɘɤ̯ɘ ɘɤ ɘɤ̯o͑ ɤɘ ɤː ɤo͑ o͑ɤ̯ɘ o͑ɤ o͑ɤ̯o͑ ɤː
{ehe} ɜʌ̯ɜ ɜʌ ɜʌ̯ɔ͑ ʌɜ ʌː ʌɔ͑ ɔ͑ʌ̯ɜ ɔ͑ʌ ɔ͑ʌ̯ɔ͑ ʌː
aa {aha} aɑ̯a aɑ̯ɒ͑ ɑa ɑː ɑɒ͑ ɒ͑ɑ̯a ɒ͑ɑ ɒ͑ɑ̯ɒ͑ ɑː
uu {iwi} ɯwɯ ɯwu͑ ɯu u͑wɯ u͑wu͑ u͑u uu͑
oo {ȩwȩ} ɤwɤ ɤwo͑ ɤo o͑wɤ o͑wo͑ o͑o oo͑
{ewe} ʌwʌ ʌwɔ͑ ʌɔ ɔ͑wʌ ɔ͑wɔ͑ ɔ͑ɔ ɔʌ ɔɔ͑ ɔː ɔː
o̧o̧ {awa} ɑwɑ ɑwɒ͑ ɑɒ ɒ͑wɑ ɒ͑wɒ͑ ɒ͑ɒ ɒɑ ɒɒ͑ ɒː ɒː

But it's proven not always as simple as replacing each phonetic sequence with the semantic geminated vowel. As I mentioned before, there's doublets with conflicting phonemic analyses resulting in alternate spellings and non-homophonous inflected forms:

Noun   Noun construct form   Meaning
Word Phonemes Phonetic Semantic Word Phonemes Phonetic Semantic
baļuun {baļiwin} [ˈpˠɑlˠu͑ˌwɯnʲ] [pˠɑlˠuːnʲ] baļuunin {baļiwinin} [pˠɑˈlˠu͑wɯˌnʲinʲ] [pˠɑlˠuːnʲ-inʲ] 'airplane'
baļwūn {baļwin} [pˠɑlˠ.wɯnʲ] baļwūnin {baļwinin} [ˈpˠɑlˠu͑wɯˌnʲinʲ] [pˠɑlˠ.wɯnʲ-inʲ]
bōļāāk {beļayak} [ˈpˠʌlˠaˌɛ̯ak] [pˠʌlˠæːk] bōļāākin {beļayakin} [pˠʌˈlˠaɛ̯aˌɡinʲ] [pˠʌlˠæːk-inʲ] 'flag'
bōļeak {beļyak} [pˠʌlˠ.ɛ̯ak] bōļeakin {beļyakin} [ˈpˠʌlˠaɛ̯aˌɡinʲ] [pˠʌlˠ.ɛ̯ak-inʲ]

It's fairly obvious that, of these pairs of words, the first form with the semantic long vowel and CVCVCVC structure more reflects how it entered the language (from the English words "balloon" and "flag," respectively), but the second form was homophonously clipped into a CVCCVC structure that is far more common in native Marshallese words. But if the reanalyzed CVCCVC homophones are still pronounced identically to CVCVCVC forms with the semantic long vowel, why not just use a geminated vowel [uː, æː] for all candidate sequences anyway? Because there are situations where this would be inappropriate.

  • Again, long vowels in Marshallese are purely semantic and do not exist on the phonemic level, and only occasionally exist on the phonetic level in sequences of identical vowels with no variation in secondary articulation.
  • The homophones are regularly homophones in normal speech, but they are not homophones when enunciated—baļwūn goes from [ˈpˠɑlˠu͑ˌwɯnʲ] in normal speech to [ˈpˠɑlˠ ˈwɯnʲ] when enunciated, and Marshallese Wiktionary entries now contain multiple pronunciations per word where their normal and enunciated pronunciations differ, reflecting the epenthetic vowels in one form and prosodic breaks at the same locations in the other form.
  • An aggressive long vowel detection algorithm can collide awkwardly with reduplicated words like wajwaj [ˈwɑdʲɑˌwɑtʲ] 'to wear a watch.' There is no word wajo̧o̧j {wajawaj} in Marshallese, but if there were, it would normally be a homophone of wajwaj, and it would be strange to give a hyper-semantic phonetic form [ˈwɑdʲɒˌːtʲ] to a word that is normally a homophone of [ˈwɑdʲɑˌwɑtʲ].

It just seems more consistent and less problematic overall to not try to auto-detect and force semantic long vowels into phonetic transcription. Transcriptions like [ˈlʲɑwɑˌdʲɘtˠ] for lo̧jet {lawjȩt} 'ocean' may appear unlovely and potentially more confusing than [ˈlʲɒːˌdʲetˠ], but the former is relatively more authentic to the language's phonetic model. Also, in some cases, the more complex phonetic form preserves artifacts in loanwords that have since been given even more simplified standardized spellings: bōkāro {bekayrew} '(interjection) stupid' derived from Japanese 馬鹿野郎 (baka yarō) and the phonetic form [pˠʌˈɡaɛ̯aˌrˠɔ͑w] actually shows the residual yarō more clearly as [ɛ̯arˠɔ͑w] than the transcription [pˠʌˈɡæːˌrˠɔ] with its strict application of semantic vowels. - Gilgamesh (talk) 07:27, 7 May 2020 (UTC)

@Gilgamesh: Regarding long vowels, I have an (OR) idea going in a completely different direction. I'll put it in my sandbox when its ripe and ping you. –Austronesier (talk) 17:40, 7 May 2020 (UTC)
@Austronesier: All right, I look forward to reading it. I also still have other potential ideas to experiment with. - Gilgamesh (talk) 17:51, 7 May 2020 (UTC)

Another idea I've had for long monophthongs...

Spelling   Phonemic   Phonetic   Semantic
ʲVːʲ ʲVːˠ ʲVːʷ ˠVːʲ ˠVːˠ ˠVːʷ ʷVːʲ ʷVːˠ ʷVːʷ
ii {iyi} iːɰ iːw ɰiː ɰiːɰ ɰiːw wiː wiːɰ wiːw
ee {ȩyȩ} eːɤ̯ eːw ɤ̯eː ɤ̯eːɤ̯ ɤ̯eːw weː weːɤ̯ weːw
{eye} ɛː ɛːʌ̯ ɛːw ʌ̯ɛː ʌ̯ɛːʌ̯ ʌ̯ɛːw wɛː wɛːʌ̯ wɛːw ɛː
āā {aya} æː æːɑ̯ æːw ɑ̯æː ɑ̯æːɑ̯ ɑ̯æːw wæː wæːɑ̯ wæːw æː
ūū {ihi} jɯːj jɯː jɯːw ɯːj ɯː ɯːw wɯːj wɯː wɯːw ɯː
ōō {ȩhȩ} e̯ɤːe̯ e̯ɤː e̯ɤːw ɤːe̯ ɤː ɤːw wɤːe̯ wɤː wɤːw ɤː
{ehe} ɛ̯ʌːɛ̯ ɛ̯ʌː ɛ̯ʌːw ʌːɛ̯ ʌː ʌːw wʌːɛ̯ wʌː wʌːw ʌː
aa {aha} ɛ̯ɑːɛ̯ ɛ̯ɑː ɛ̯ɑːw ɑːɛ̯ ɑː ɑːw wɑːɛ̯ wɑː wɑːw ɑː
uu {iwi} juːj juːɰ juː ɰuːj ɰuːɰ ɰuː uːj uːɰ
oo {ȩwȩ} e̯oːe̯ e̯oːɤ̯ e̯oː ɤ̯oːe̯ ɤ̯oːɤ̯ ɤ̯oː oːe̯ oːɤ̯
{ewe} ɛ̯ɔːɛ̯ ɛ̯ɔːʌ̯ ɛ̯ɔː ʌ̯ɔːɛ̯ ʌ̯ɔːʌ̯ ʌ̯ɔː ɔːɛ̯ ɔːʌ̯ ɔː ɔː
o̧o̧ {awa} ɛ̯ɒːɛ̯ ɛ̯ɒːɑ̯ ɛ̯ɒː ɑ̯ɒːɛ̯ ɑ̯ɒːɑ̯ ɑ̯ɒː ɒːɛ̯ ɒːɑ̯ ɒː ɒː

Of course, there are still problems with this, like Jālwōj ({jalwȩj} → {jálȩwȩ́j}) and Jālooj ({jalȩwȩj} → {jálȩwȩ́j}) still being homophones in normal speech and their pronunciation still clearly being something close to [ˈtʲælʲɤˌwɤtʲ], not [ˈtʲælʲe̯oˌːe̯tʲ].
Another problem that remains is where in a word to reliably algorithmically detect long monophthongs. For most words, detecting such vowels by the Bender phonemes alone is trivial. But there arise words where software detection of long monophthongs runs into potential pitfalls.

  • Āne-jaōeōe {yanȩy-jahȩyhȩy} → {yanȩ́yȩjahȩ́yȩhȩ́y}
  • Ānewetak {yanȩyweytak} → {yanéyewéyeták}, 'Enewetak'
  • awa {hawáh}, 'hour'
  • awaan {háwahán}, 'hour of'
  • Awai {hawahyiy} → {hawáhayíy}, 'Hawaiʻi'
  • Ewerōk {yȩwȩyrȩk} → {yȩwȩ́yȩrȩ́k}
  • iakiuin {yi'yakíyiwín}, 'baseball of'
  • iiaeae {'yiyahyahyey} → {yíyiyáhayáhayéy}, 'rainbow color'
  • indeeo {yindȩyyew} → {yíndéyeyéw}, 'forever'
  • M̧aļoeļap {m̧aļewyeļap} → {m̧aļéweyeļáp}, 'Maloelap'
  • o̧waj {wawwaj} → {wáwawáj}, 'horse'
  • piaea {piyahyah} → {piyáhayáh}, 'fish roe'
  • pilawā {pilahway} → {piláhawáy}, 'bread'
  • piwūj {piywij} → {píyiwíj}, 'fuse'

Hopefully you should see the issues right off the bat:

  • Where multiple would-be long vowel sequences overlap, deciding which long vowel nucleus should take precedence. And if there's more than one neighboring each other, where their rhythm lies. This consideration is complicated by the fact that {aya, áya, ayá} can all form perfectly valid long vowel sequences.
  • The same old awa problem, where a would-be long vowel sequence appears to be cancelled, at least at the semantic level.

Intuitively, most of these cases are easy to solve. I've had several theories of how to programmatically resolve these conflicts in an algorithm, but some are OR enough that I'm not certain enough about their reliability or wisdom. Some of these theories include (but are not exhaustively limited to):

  1. A full vowel-semiconsonant-epenthetic vowel sequence always gravitates towards long monophthong status, and takes highest precedence among neighboring candidates. This completely solves Āne(e)-jaōe(e)ōe, Āne(e)we(e)tak, Awa(a)i, Ewe(e)rōk, i(i)ia(a)ea(a)e, inde(e)eo, M̧aļo(o)eļap, o̧(o̧)waj, pia(a)ea, pila(a)wā, pi(i)wūj right off the bat—almost all the examples I cited.
  2. A would-be long monophthong with a {y} or {w} nucleus may be cancelled if it neighbors {h} on either of its other sides. This potentially solves (h)awa(h), (h)awa(h)an.

The remaining example is iakiuin. Obviously, its uninflected form iakiu is an easy example to solve as it only has the one {iyi} candidate sequence for an algorithmically detected [ˈjaɡiˌːw]. But iakiuin has overlapping {iyiwi}, and since rhythm doesn't really factor into what is treated like a long vowel or not, the outcomes [jaˈɡiːˌwɯnʲ] and [jaˈɡɨjuˌːnʲ] both seem equally possible. I cannot settle on either outcome without it being OR. So the most non-OR option left on the table has been...no special logic for long monophthongs whatsoever, which results in [jaˈɡɨjɯˌwɯnʲ] for iakiuin, and [ˈjaɡɨˌjɯw] for iakiu. - Gilgamesh (talk) 11:41, 8 May 2020 (UTC)

Cedillas

The article states that the cedillas should appear ‘unaltered’, which I take to mean like 𐒑, and that's how they appear on the stamps shown in the Marshallese-English Online Dictionary but according to Comments on cedilla and comma below Marshallese allows for style variation and several different cedilla styles are shown. One of these is called an acute style, and this style, or something similar, is used for both the Romanian commas below and the Latvian cedillas in many fonts, such as Calibri, Consolas and DejaVu.

As a purely practical matter, the Marshallese characters shown in the article display atrociously for most readers:
ĻM̧ŅO̧ ļm̧ņo̧
This may not be true for you of course, but most readers will see commas below the L and M, a 𐒑-cedilla below the O and a 𐒑-cedilla that can't decide whether it wants to belong to the M or to the next letter.

Why not instead use a comma below, or Latvian cedilla if you will, in all cases?
ĻM̦ŅO̦ ļm̦ņo̦
I realise of course that this may not be technically correct in terms of the underlying code points and could possibly lead to similar issues with Marshallese proofing tools as using the Greek ο in an English wοrd, but most of our readers aren't Marshallese and for most readers this would look better. By the way, the ZWNJ solution has the same issue and doesn't seem to work right for M anyway.

Marshallese people writing in Marshallese are unlikely to copy-paste the characters from this article. They will type their text themselves and use fonts with special support for Marshallese. They will be able to take care of themselves. Our readers however don't benefit from the current state of affairs, which is why I think it needs to change. My proposal may of course not be the best solution and if you think there's a better way please share it. But it shouldn't be left as is. — Preceding unsigned comment added by 2A02:A457:9497:1:A05B:A2F3:89CB:59B9 (talk) 14:03, 6 April 2020 (UTC)

Alternative display conventions for Marshallese are often used as graphical workarounds in both digital text and some print text where the display of standard diacritics cannot be properly achieved. However, none of these workarounds are standardized, and polished Wikipedia articles in particular do not settle for ad hoc character encoding as a primary character display. These particular Marshallese letters use cedillas, so they are encoded with the corresponding Unicode characters specified to use cedillas, along with lang="mh" language encoding where possible, which is sufficient to display the characters correctly in certain newer Unicode fonts and text rendering engines. If the characters are encoded correctly but display problems persist on the client side, then it is the client's fonts and text rendering that are at fault and require fixes, not the Marshallese text itself. And as it stands, in many clients where the Marshallese text still renders improperly, the text is at least still legible. - Gilgamesh (talk) 11:09, 8 April 2020 (UTC)

A zero-width joiner should take care of it, but WP ignores it when rendering Unicode characters. We could address that. — kwami (talk) 21:30, 28 May 2020 (UTC)

IPA algorithm milestone

All right, I have some working concepts right now. But first, I'll describe the process step by step, starting with how these algorithm concepts are applied to phonemic forms after dialect considerations. This roughly describes the tasks that must be performed by the module's toPhoneticRemainder function.


Separate approaches for enunciated or non-enunciated mode. Enunciated words are spoken carefully, while non-enunciated words reflect usual organic speech rhythm.

At this step, if a phonetic pronunciation is being generated in enunciated mode, treat each consonant cluster as a prosodic break:

  • /mˠæɰtʲɛlˠ mˠæɰ tʲɛlˠ/ (final forms [ˈmˠɑːˌdʲɜlˠ] vs. [ˈmˠɑ ˈtʲɜlˠ])

Pretty simple so far.


If the sequence {yi'y} /ji̯j/ does not come at the beginning of a prosodic unit, it becomes full {yiy} /jij/.

Lingering uncertainty: Does this also apply to the term io̧kio̧kwe {yi'yakʷyi'yakʷey}? At this stage, does it become /ji̯jækʷjijækʷɛj/ (creating a CVCCVCVC rhythm bizarre for reduplicated terms in language), or does it remain /ji̯jækʷji̯jækʷɛj/ (with the more natural CVCCVC rhythm)?


The sequence {'yiy} /jiːj/ vocalizes differently if it comes after a consonant within a prosodic unit than when it comes after a vowel or the beginning of the prosodic unit.

  • Normal /jiːj jijj/
  • Post-consonantal /Cjiːj Cijj/

Mid-vowel harmony assimilation across semiconsonants. The following assimilations happen regressively, where G represents any semiconsonant or cluster of semiconsonants in continuous speech:

  • {eGi} /ɛGi/{ȩGi} /eGi/
  • {eGȩ} /ɛGe/{ȩGȩ} /eGe/
  • {ȩGe} /eGɛ/{eGe} /ɛGɛ/
  • {ȩGa} /eGæ/{eGa} /ɛGæ/

The MRG (Marshallese Reference Grammar (2016)) doesn't specifically mention the fourth sequence as triggering mid-vowel harmony, but describes it as something that cannot occur.


Determine syllable stress within a prosodic unit. Stress occurs on the CV of the final CVC sequence of a word, and on the first CV of every CVC, CVCV and CVCCV sequence before that:

  • /ˈCVC/
  • /CVˈCVC/
  • /ˈCVCˈCVC/
  • /ˈCVCVˈCVC/
  • /ˈCVCCVˈCVC/
  • /CVˈCVCCVˈCVC/

...and so forth. If there is more than one stressed syllable in the prosodic unit, the penultimate stressed syllable may gets primary stress, and every other syllable gets secondary stress:

  • /ˈCVCˌCVC/
  • /ˈCVCVˌCVC/
  • /ˈCVCCVˌCVC/
  • /ˌCVCˈCVCˌCVC/

In truth, either of the last two stressed syllables may receive primary stress where convenient, but not the antepenultimate stressed syllable or any stressed syllables before that.


Now that the stress pattern has been determined, process consonant clusters for assimilations and epenthesis. This entire process is rendered moot in enunciated mode. But where it applies, I won't rehash all the rules here, since they're already mostly or fully described in the article.

However, the MRG did provide clearer rules on how to determine the F1 of epenthetic vowels.

  • Between two full consonants, the epenthetic vowel's F1 is the max() of the F1 of the vowels on the right and left and the vowel {e} /ɛ/. The open vowel {a} /æ/ does not occur epenthetically in this circumstance.
  • Between one full consonant and one semiconsonant, the epenthetic vowel's F1 is the same as whichever vowel occurs on the other side of the semiconsonant, no matter which side of the cluster the semiconsonant is located. The open vowel {a} /æ/ is allowed here.
  • Between two semiconsonants, the epenthetic vowel's F1 is the same as the closest vowel to the left.

From this point on, the algorithm no longer makes a distinction between full vowels and epenthetic vowels:

  • /ˈCVC(V)ˌCVC ˈCVCVˌCVC/
  • /ˈCVC(V)CVˌCVC ˈCVCVCVˌCVC/

Native speakers intuitively know where the true consonant clusters lie, so even though from this point epenthetic vowels phonetically behave the same way as full vowels and their F1 and F2 are predictable, native speakers do not normally perceive them to exist at all. This is why there's a separate enunciated mode to make this clear:

  • /ˈCVC(V)ˌCVC ˈCVC ˈCVC/
  • /ˈCVC(V)CVˌCVC ˈCVC CVˈCVC/

Being a vertical vowel language, vowel F2 is not phonemic. Between two consonants with different secondary articulations, single vowels tend to sound like diphthongs or triphthongs in slow speech. But in faster speech, especially between two full consonants, the vowel's F2 midpoint tends to stand out as the vowel's only sound. And since we've established that transcriptions like [mʲe͡ɤ͡oˈu͡ɯrˠ] from /mʲeˈwirˠ/ appear convoluted and hard to read, my newer algorithm concepts stick to the vowel midpoints, in this case [mʲɤˈwu͑rˠ]. So basically:

Cʲ‿Cʲ Cʲ‿Cˠ
Cˠ‿Cʲ
Cʲ‿Cʷ
Cˠ‿Cˠ
Cʷ‿Cʲ
Cˠ‿Cʷ
Cʷ‿Cˠ
Cʷ‿Cʷ
i ɨ ɯ u
e ɘ ɤ o
ɛ ɜ ʌ ɔ͑ ɔ
æ a ɑ ɒ͑ ɒ

Now that the vowels have an allophonic F2, the semiconsonants need an allophonic F1. For now, temporarily, semiconsonants at the beginning or end of a prosodic unit assume the F1 of their neighboring vowels:

  j‿Cʲ j‿Cˠ j‿Cʷ ɰ‿Cʲ ɰ‿Cˠ ɰ‿Cʷ w‿Cʲ w‿Cˠ w‿Cʷ
GiC jiCʲ jɨCˠ jɯCʷ ɰɨCʲ ɰɯCˠ ɰu͑Cʷ wɯCʲ wu͑Cˠ wuCʷ
GeC e̯eCʲ e̯ɘCˠ e̯ɤCʷ ɤ̯ɘCʲ ɤ̯ɤCˠ ɤ̯o͑Cʷ o̯ɤCʲ o̯o͑Cˠ o̯oCʷ
GɛC ɛ̯ɛCʲ ɛ̯ɜCˠ ɛ̯ʌCʷ ʌ̯ɘCʲ ʌ̯ɤCˠ ʌ̯o͑Cʷ ɔ̯ʌCʲ ɔ̯ɔ͑Cˠ ɔ̯ɔCʷ
GæC æ̯æCʲ æ̯aCˠ æ̯ɑCʷ ɑ̯aCʲ ɑ̯ɑCˠ ɑ̯ɒ͑Cʷ ɒ̯ɑCʲ ɒ̯ɒ͑Cˠ ɒ̯ɒCʷ

But for semiconsonants between vowels:

  • The F1 of {y} /j/ or {w} /w/, is the max() of the F1 of the two neighboring vowels. Any remaining special sequence /ji̯j/ simply becomes [j].
  • The F1 of {h} /ɰ/ (and, again, this is a logic leap that may be slightly OR) is the min() of the F1 of the two neighboring vowels. This is in line with Bender (1968) mentioning his native-Arabic-speaking colleague saying that {h} was an ayin (a pharyngeal consonant), and evidenced by how intervocalic {h} is treated in the spellings of the words iakwāāl {yihakʷayal} 'quarrel' and otobai {wetewbahiy} 'motorbike'.

At this point, the pronunciation is getting a lot closer to how the language is acoustically heard.


Begin OR. These experimental steps are an attempt to simplify phonetic IPA to make it easier to read, hopefully without introducing ambiguities. These steps are not finalized, and I would say they probably do not apply in enunciated mode anyway.


If a prosodic unit ends with a vowel and semiconsonant, change the vowel's F2 to match the semiconsonant's, and allow no further changes to those particular vowels:

  • io̧kwe {yi'yakʷey} /ji̯jækʷɛj/ [jɑˈkʷɤe̯ jɑˈkʷee̯] (final form [jɑˈɡʷɛ]) 'love'

Detect and simplify long monophthongs on syllables with primary or secondary stress, and allow no further changes to these particular vowels:

  • jojo {jȩwjȩw} /tʲewtʲew/ [ˈtʲɤo̯ɤˌtʲɤo̯ ˈtʲoo̯oˌtʲoo̯] (final form [ˈtʲoːˌdʲo]) 'flying fish'
  • ōeōe {hȩyhȩy} /ɰejɰej/ [ˈɤ̯ɘe̯ɘˌɤ̯ɘe̯ ˈɤ̯ee̯eˌɤ̯ee̯] (final form [ˈʁeːˌʁe]) 'demon sound onomatopoeia'

Unexpected (but possibly acceptable) side effect:

  • awaan {hawahan} /ɰæwæɰænʲ/ [ˈɑ̯ɒ͑ɒ̯ɒ͑ˌɑ̯anʲ ˈɑ̯ɒɒ̯ɒˌɑ̯anʲ] (final form [ˈʕɒːˌɑnʲ]) 'hour of'

Similar to two steps ago, for all single vowels preceding semiconsonants of the same F1, change the vowel's F2 to match the semiconsonant's, but allow further changes:

  • Nuwio̧o̧k {niwiyawak} /nʲiwijæwæk/ [nʲɯˈwɯjɑˌɒ̯ɒ͑kˠ nʲuˈwijɒˌɒ̯ɒ͑kˠ] (final form [nʲuˈwiɒˌwɒ͑kˠ]) 'New York'

If any single vowel besides {a} /æ/ occurs after {y} /j/ of the same F1 and before a velarized full consonant, change the vowel's F2 to match the {y}:

  • em̧m̧an {yem̧m̧an} /jɛmˠmˠænʲ/ [ˈɛ̯ɜmˠˌmˠanʲ ˈɛ̯ɛmˠˌmˠanʲ] (final form [ˈɛmˌmˠanʲ]) 'it is good'

If any single vowel besides {a} /æ/ occurs after {y} /j/ of the same F1 and before any rounded consonant that begins the next stressed syllable, change the vowel's F2 to match the {y}:

  • ewan {yewan} /jɛwænʲ/ [ɛ̯ɔˈɔ̯ɑnʲ ɛ̯ɛˈɔ̯ɑnʲ] (final form [ɛˈwɑnʲ]) 'a time for being engaged in special activity'
  • iiāekwōj {yiyayȩkʷȩj} /jiːjæjekʷetʲ/ [ˌjijiˈjæe̯ɤˌkʷɤtʲ ˌjijiˈjæe̯eˌkʷɤtʲ] (final form [ˌiːˈjæeˌɡʷɤtʲ]) 'to race'

If a single vowel comes after {h} /ɰ/ of the same F1, change its F2 to match:

  • aelōn̄ {hayȩlȩg} /ɰæjelʲeŋ/ [ˈɑ̯ae̯eˌlʲɘŋˠ ˈɑ̯ɑe̯eˌlʲɘŋˠ] (final form [ˈɑeˌlʲɘŋˠ]) 'atoll'

If any single vowel besides {a} /æ/ occurs after {w} /w/ of the same F1 and before a stressed syllable (regardless the consonant), changed the vowel's F2 to match the {w}:

  • jedo̧ujij {jedawijij} /tʲɛrʲæwitʲitʲ/ [tʲɛˈrʲɑwɯˌtʲitʲ tʲɛˈrʲɑwuˌtʲitʲ] (final form [tʲɛˈrʲɑuˌdʲitʲ]) 'trousers'

End OR. The remaining steps apply whether or not these OR steps did.


Unsurface semiconsonants whose F1 and F2 are an exact match for at least one of their neighboring vowels:

  • āne {yanȩy} /jænʲej/ [æ̯æˌnʲee̯ æˌnʲe] 'islet'
  • Būļāide {biļayidey} /pˠilˠæjirʲɛj/ [pˠɯˈlˠajiˌrʲɛɛ̯ pˠɯˈlˠaiˌrʲɛ] (also its final form) 'Friday'

But not if they cross a stressed syllable boundary:

  • iiaeae {'yiyahyahyey} /jiːjæɰjæɰjɛj/ [ˌjijiˌjɑɑ̯ɑˈɛ̯ɑɑ̯ɑˌɛ̯ɛɛ̯ ˌiiˌjɑɑˈɛ̯ɑɑˌɛ] (final form [ˌiːˌjɑːˈɛ̯ɑːˌɛ]), not [ˌiiˌɑɑˈɛ̯ɑɑˌɛ] 'rainbow-colored'

And not if the vowel is an epenthetic vowel before the semiconsonant second member of a consonant cluster:

  • jebkwanwūjo̧ {jebkʷanwijaw} /tʲɛpˠkʷænʲwitʲæw/ [ˌtʲɜpˠɔ͑ˈkʷɑnʲuwɯˌtʲɒɒ̯ ˌtʲɜpˠɔ͑ˈkʷɑnʲuwɯˌtʲɒ] (final form [ˌtʲɜbˠɔ͑ˈɡʷɑnʲuwɯˌdʲɒ]), not [ˌtʲɜpˠɔ͑ˈkʷɑnʲuɯˌtʲɒ] 'coconut oil used for frying'

Even though we're no longer treating epenthetic vowels differently from normal vowels, these can still be detected in context because the original CVCGV sequence created a syncopated /ˈCVCVGV/ sequence that cannot otherwise occur.


Adjust the F1 of remaining surfaced semiconsonants:

  • [æ̯ ɛ̯]
  • [ɒ̯ w]
  • [ɔ̯ w]
  • [o̯ w]

Also possibly use these symbols, though it is not exactly conventional:

  • [ɑ̯ ʕ]
  • [ʌ̯ ʁ]
  • [ɤ̯ ʁ]

If not enunciating, indicate long stressed monophthongs as geminated vowels:

  • jijāj {jiyjaj} /tʲijtʲætʲ/ [ˈtiiˌtʲætʲ ˈtiːˌtʲætʲ] (final form [ˈtiːˌdʲætʲ]) 'scissors'

Simplify the secondary articulation of non-epenthetic consonant clusters, even across syllable stress boundaries:

  • babbūb {babbib} /pˠæpˠpˠipˠ/ [ˈpˠɑpˠˌpˠɯpˠ ˈpˠɑpˌpˠɯpˠ] (also its final form) 'butterfly'

Voice single obstruent consonants between two vowels:

  • [VpV VbV]
  • [VtV VdV]
  • [VkV VɡV]

If not enunciating, voice obstruent consonants where they are the second member of a consonant cluster where the first member is a nasal consonant:

  • [mp mb]
  • [nt nd]
  • [ŋk ŋɡ]

So that's basically the meat of the algorithm as it currently stands. Check wiktionary:Module:mh-pronunc for the algorithm's current result state as the second line of each term's phonetic pronunciation. - Gilgamesh (talk) 18:59, 25 May 2020 (UTC)


@Austronesier, Erutuon, and Kwamikagami: Do any of you have any thoughts about, issues concerning or suggested modifications for this? I would like this to be as consensus-informed and well-polished as possible before I consider replacing the current phonetic algorithm. - Gilgamesh (talk) 14:30, 28 May 2020 (UTC)

I can't follow everything, but I would not agree with any reification of artefacts of Bender's deep phonemic analysis: 1) Long vowels should remain notated as long vowels 2) the ghost phoneme /ɰ/ should be represented as phonetic zero. –Austronesier (talk) 14:47, 28 May 2020 (UTC)
I'd be leery of an Arabic speaker IDing [ʕ]. He'd've heard the closest thing in his language too. I've been steered wrong by people IDing a sound as equivalent to a pharyngeal in their language. — kwami (talk) 21:56, 28 May 2020 (UTC)
You raise good points. This is why I laid this out.
I think {h} can be represented as a phonetic zero, and just have it not surface anywhere. I would probably make one modification to the vowel symbols to compensate and minimize visual confusion, with changes in bold:
Cʲ‿Cʲ Cʲ‿Cˠ
Cˠ‿Cʲ
Cʲ‿Cʷ
Cˠ‿Cˠ
Cʷ‿Cʲ
Cˠ‿Cʷ
Cʷ‿Cˠ
Cʷ‿Cʷ
i ɨ ɯ ɯ͗ u
e ɘ ɤ ɤ͗ o
ɛ ɜ ʌ ʌ͗ ɔ
æ a ɑ ɑ͗ ɒ
This also fixes the second problem, as now, no character can be portrayed as [ʕ] or [ʁ].
But this also introduces a dilemma to ōeōe {hȩyhȩy} /ɰejɰej/. If I continue to treat the first half as a long vowel, then what was [ˈɤ̯eːˌɤ̯e] or [ˈʁeːˌʁe] now becomes [ˈeːˌe]. Treating {h} as a phonetic zero consonant does not mean it has no phonetic influence at all, but it does mean that either we make an exception for these conditions or we don't notate long vowels here. Perhaps [ˈɘe̯ɘˌɘe̯] or [ˈɘeˌɘe̯]? The equivalent for awaan {hawahan} /ɰæwæɰænʲ/ would be [ˈɑ͗wɑ͗ˌanʲ] or [ˈɑ͗ɒˌanʲ], respectively, so that's worth taking into consideration.
This deserves more thought than I can give it right now, but I wanted to at least respond promptly. - Gilgamesh (talk) 10:22, 29 May 2020 (UTC)
As for the Arabic-speaker anecdote: in many Arabic dialects, ayin is pronounced very weak, so I suspect that any smooth onset without a glottal stop will be perceived by their speakers as ayin (especially for dialects that have the change [q] > [ʔ]). Rene van den Berg describes a similar thing for Southern Muna here[1] (p.142), which I can confirm from my own experience with that dialect. Syllable-initial zero does sound like "something".
I am not sure about ōeōe. I suspect it's actually /hẹhẹyhẹhẹy/. I'm still working on a less abstract phonemic analysis of Marshallese long vowels (and maybe submit it somewhere if I'm sure about it), which you could use as an internally tool for simplifying the code. –Austronesier (talk) 11:40, 29 May 2020 (UTC)
Thank you, I appreciate your efforts. I'll keep working on it at my end as well. - Gilgamesh (talk) 16:01, 29 May 2020 (UTC)