Onset clusters

Onset clusters

Thai has a small inventory of true consonant clusters — sequences of two consonants at the start of a syllable, both pronounced. These contrast with visually similar sequences where only one consonant contributes to the onset.


True clusters

In native Thai and older Sanskritic borrowings, the following cluster patterns occur as genuine two-consonant onsets:

Pattern IPA Examples
stop + /r/ /pr/ /pʰr/ /tr/ /tʰr/ /kr/ /kʰr/ ปลา, ประ, ตรา, กร
stop + /l/ /pl/ /pʰl/ /kl/ /kʰl/ ปลา, ผล, กลาง
stop + /w/ /kw/ /kʰw/ กว้าง, ควาย

These are encoded as Cluster objects in the onset position of the Syllable:

from thaiphon import analyze

result = analyze("ปลา")
syl = result.best.syllables[0]
print(type(syl.onset).__name__)   # Cluster
print(syl.onset.first.symbol)     # 'p'
print(syl.onset.second.symbol)    # 'l'

Cluster simplification

In everyday colloquial speech, some cluster onsets are simplified by dropping the second consonant. This is lexicon-driven — not all clusters simplify, and whether one does depends on the word and the register.

The everyday profile applies cluster simplification where it is standard for that word. The careful_educated profile retains more clusters.

thaiphon does not yet implement comprehensive cluster simplification across all vocabulary; this is an area of active development.


Pseudo-clusters: aksornam

A sequence that looks like two onset consonants may actually be an aksornam (leader) construction: a lone consonant without a vowel, followed by a second syllable whose onset is an LC sonorant. In this case the first character is not pronounced as part of a cluster — it functions solely as a tone modifier for the following syllable.

Example: สมาน

  • ส (HC, no vowel) — this is the aksornam leader, not a cluster component.
  • มาน — the syllable. Its onset ม is LC sonorant, but under the HC leader's influence it takes HC effective class → rising tone.

The two patterns are distinguishable because true clusters share a vowel nucleus with the onset pair, whereas an aksornam has no vowel of its own — the leader character is bare.


Inserted-vowel clusters: the กว/ขว/คว pattern

Some three-character sequences — consonant + ว + coda — look like a cluster onset but are actually pronounced with an inserted /u/ vowel:

Word Literal cluster reading Actual pronunciation
ขวาน (axe) /kʰwan/ /kʰuːan/ — via /uːə/ nucleus
กวาด (to sweep) /kwat/ standard cluster

These cases are handled by a closed lexicon of known insert-U words. The runner detects them before syllabification and respells the input to make the inserted vowel explicit, so the rest of the derivation pipeline processes it normally.


Syllabification of clusters

When the candidate generator encounters a possible cluster, it produces candidates both with and without the cluster interpretation. The ranker then selects the best candidate based on structural scoring — genuine clusters score higher when the phonotactic pattern is productive (stops followed by /r l w/).

For unknown words, the engine generally prefers to interpret stop+sonorant sequences as clusters when they match attested cluster patterns, and as aksornam otherwise.