Special cases¶
Several orthographic constructions in Thai require special handling that goes beyond the standard syllabification and derivation rules.
Leading ห — the silent tone-shifter¶
When the High-class letter ห appears immediately before a Low-class sonorant (ง ญ น ม ย ร ล ว) without a vowel sign between them, it is not pronounced. Instead, it shifts the following syllable's effective class from Low to High.
This is a purely orthographic tone-shifting device. The ห contributes no phoneme to the output — only a change of class.
Examples:
| Written | Without leading ห | With leading ห | Tone shift |
|---|---|---|---|
| หนา | นา = LC, live → MID tone | หนา = promoted to HC, live → RISING | MID → RISING |
| หมา | มา = LC → MID | หมา = HC → RISING | MID → RISING |
| หนึ่ง | นึ่ง = LC, ◌่ = FALLING | หนึ่ง = HC, ◌่ = LOW | FALLING → LOW |
In the API:
from thaiphon import transcribe
transcribe("หนา", scheme="ipa") # /naː˩˩˦/ — rising tone
transcribe("นา", scheme="ipa") # /naː˧/ — mid tone
The effective_class on the Syllable object records HIGH for syllables that received the leading-ห promotion. The onset field still records the actual pronounced consonant (not ห).
Sara Am — decomposition of ◌ำ¶
Sara Am (◌ำ, Unicode U+0E33) looks like a single vowel mark but phonemically decomposes into a long /aː/ vowel plus a nasal /m/ coda:
This decomposition happens during input expansion, before syllabification. The word น้ำ (water) contains:
- น — onset /n/, LC sonorant
- ้ — mai tho tone mark
- ำ — decomposes to long /aː/ + /m/ coda
Result: onset /n/ + vowel /aː/ LONG + coda /m/ + LC class + mai tho → HIGH tone. IPA: /naːm˦˥/.
Thanthakhat — the killer mark ◌์¶
Thanthakhat (◌์, U+0E4C) marks a consonant as silent — it is pronounced but not represented in the phonological output. Common in Sanskrit and Pali loanwords where the orthography retains more consonants than Thai phonology permits.
Simple case: เดิน → เดิน (walk) has no thanthakhat. But ศักดิ์ (dignity, from Sanskrit) has ◌ิ + ์ on ด, killing the ด + ิ from phonological output.
Fossil clusters: When thanthakhat kills the last consonant of a Sanskrit-fossil cluster (e.g. จันทร์, พักตร์), thaiphon identifies the two-letter silent cluster (ทร, ตร, etc.) from a conservative list and kills both letters. Single-letter killing handles the general case.
The thanthakhat handling runs inside the final-consonant extraction step of the derivation pipeline, before vowel and coda are resolved.
ทร — the ambiguous digraph¶
The sequence ทร (tho thahan + ro rua) has three possible readings depending on the word:
| Reading | Phoneme | Example |
|---|---|---|
| /s/ | s |
ทราบ (to know), ทราย (sand) |
| /tʰr/ | true cluster | ทรง (to sustain), ทรัพย์ (wealth) |
| ทอ-ระ (two syllables) | tʰɔː + ra | ทะเลทราย (for some compounds) |
thaiphon looks up each word containing ทร in a dedicated lexicon and applies the correct reading. Words not in the lexicon receive the cluster reading /tʰr/ as a default.
ฤ and ฤๅ — the obsolete vowel letters¶
ฤ (sara rue) and ฤๅ (sara rue long) are archaic vowel letters that appear in a small set of Thai words, mostly Sanskrit borrowings.
Common readings: - ฤ → /ri/ (short), /rɯː/ (long), or /rɤː/ - ฤๅ → /rɯː/ (always long)
Like ทร, these are handled by a closed lexicon: each word containing ฤ or ฤๅ is listed with its pronunciation. The runner substitutes a pronounceable respelling (using regular Thai vowel marks) before running the rest of the derivation pipeline.
ไ / ใ — the two Sara Ai¶
Thai has two letters that both produce the same /aj/ diphthong: ไ (sara ai maimuan, U+0E44) and ใ (sara ai maimalai, U+0E43). They are orthographically distinct but phonologically identical.
thaiphon treats both as pre-vowels marking the /aj/ nucleus. The distinction is purely historical and is preserved only in the orthographic raw field.
ๆ — mai yamok (repetition mark)¶
ๆ (mai yamok, U+0E46) indicates that the preceding word should be repeated. thaiphon expands this in the normalisation phase before any phonological processing:
After expansion the repeated form is processed normally.
ฯลฯ — etc. abbreviation¶
The three-character sequence ฯลฯ (lakkhangyao-lo-lakkhangyao) is an abbreviation for "and so on" (analogous to "etc."). thaiphon expands it to และอื่นๆ before processing.
Thai digits¶
Single Thai digits (๐ ๑ ๒ … ๙) are expanded to their Thai word forms before phonological processing:
| Digit | Word | Pronunciation |
|---|---|---|
| ๐ | ศูนย์ | /suːn˩˩˦/ |
| ๑ | หนึ่ง | /nɯŋ˨˩/ |
| ๒ | สอง | /sɔːŋ˩˩˦/ |
| ... | ... | ... |
Multi-digit sequences are passed through without expansion (positional reading is too complex for a single rule).