Special cases

Special cases

Several orthographic constructions in Thai require special handling that goes beyond the standard syllabification and derivation rules.


Leading ห — the silent tone-shifter

When the High-class letter ห appears immediately before a Low-class sonorant (ง ญ น ม ย ร ล ว) without a vowel sign between them, it is not pronounced. Instead, it shifts the following syllable's effective class from Low to High.

This is a purely orthographic tone-shifting device. The ห contributes no phoneme to the output — only a change of class.

Examples:

Written Without leading ห With leading ห Tone shift
หนา นา = LC, live → MID tone หนา = promoted to HC, live → RISING MID → RISING
หมา มา = LC → MID หมา = HC → RISING MID → RISING
หนึ่ง นึ่ง = LC, ◌่ = FALLING หนึ่ง = HC, ◌่ = LOW FALLING → LOW

In the API:

from thaiphon import transcribe

transcribe("หนา", scheme="ipa")   # /naː˩˩˦/  — rising tone
transcribe("นา",  scheme="ipa")   # /naː˧/    — mid tone

The effective_class on the Syllable object records HIGH for syllables that received the leading-ห promotion. The onset field still records the actual pronounced consonant (not ห).


Sara Am — decomposition of ◌ำ

Sara Am (◌ำ, Unicode U+0E33) looks like a single vowel mark but phonemically decomposes into a long /aː/ vowel plus a nasal /m/ coda:

◌ำ  →  long /aː/ + coda /m/

This decomposition happens during input expansion, before syllabification. The word น้ำ (water) contains:

  • น — onset /n/, LC sonorant
  • ้ — mai tho tone mark
  • ำ — decomposes to long /aː/ + /m/ coda

Result: onset /n/ + vowel /aː/ LONG + coda /m/ + LC class + mai tho → HIGH tone. IPA: /naːm˦˥/.


Thanthakhat — the killer mark ◌์

Thanthakhat (◌์, U+0E4C) marks a consonant as silent — it is pronounced but not represented in the phonological output. Common in Sanskrit and Pali loanwords where the orthography retains more consonants than Thai phonology permits.

Simple case: เดิน → เดิน (walk) has no thanthakhat. But ศักดิ์ (dignity, from Sanskrit) has ◌ิ + ์ on ด, killing the ด + ิ from phonological output.

Fossil clusters: When thanthakhat kills the last consonant of a Sanskrit-fossil cluster (e.g. จันทร์, พักตร์), thaiphon identifies the two-letter silent cluster (ทร, ตร, etc.) from a conservative list and kills both letters. Single-letter killing handles the general case.

The thanthakhat handling runs inside the final-consonant extraction step of the derivation pipeline, before vowel and coda are resolved.


ทร — the ambiguous digraph

The sequence ทร (tho thahan + ro rua) has three possible readings depending on the word:

Reading Phoneme Example
/s/ s ทราบ (to know), ทราย (sand)
/tʰr/ true cluster ทรง (to sustain), ทรัพย์ (wealth)
ทอ-ระ (two syllables) tʰɔː + ra ทะเลทราย (for some compounds)

thaiphon looks up each word containing ทร in a dedicated lexicon and applies the correct reading. Words not in the lexicon receive the cluster reading /tʰr/ as a default.


ฤ and ฤๅ — the obsolete vowel letters

ฤ (sara rue) and ฤๅ (sara rue long) are archaic vowel letters that appear in a small set of Thai words, mostly Sanskrit borrowings.

Common readings: - ฤ → /ri/ (short), /rɯː/ (long), or /rɤː/ - ฤๅ → /rɯː/ (always long)

Like ทร, these are handled by a closed lexicon: each word containing ฤ or ฤๅ is listed with its pronunciation. The runner substitutes a pronounceable respelling (using regular Thai vowel marks) before running the rest of the derivation pipeline.


ไ / ใ — the two Sara Ai

Thai has two letters that both produce the same /aj/ diphthong: ไ (sara ai maimuan, U+0E44) and ใ (sara ai maimalai, U+0E43). They are orthographically distinct but phonologically identical.

thaiphon treats both as pre-vowels marking the /aj/ nucleus. The distinction is purely historical and is preserved only in the orthographic raw field.


ๆ — mai yamok (repetition mark)

ๆ (mai yamok, U+0E46) indicates that the preceding word should be repeated. thaiphon expands this in the normalisation phase before any phonological processing:

ต้นไม้ๆ  →  ต้นไม้ต้นไม้

After expansion the repeated form is processed normally.


ฯลฯ — etc. abbreviation

The three-character sequence ฯลฯ (lakkhangyao-lo-lakkhangyao) is an abbreviation for "and so on" (analogous to "etc."). thaiphon expands it to และอื่นๆ before processing.


Thai digits

Single Thai digits (๐ ๑ ๒ … ๙) are expanded to their Thai word forms before phonological processing:

Digit Word Pronunciation
ศูนย์ /suːn˩˩˦/
หนึ่ง /nɯŋ˨˩/
สอง /sɔːŋ˩˩˦/
... ... ...

Multi-digit sequences are passed through without expansion (positional reading is too complex for a single rule).