What thaiphon does

What thaiphon does

thaiphon reads Thai script and produces a pronunciation guide.

Thai uses an alphabetic writing system, but the relationship between spelling and sound is complex. A single Thai consonant letter can have two or more pronunciations depending on where it appears in a syllable. Five distinct tones exist, and no single letter marks tone directly — instead, tone is calculated from the consonant class, the vowel length, the syllable shape, and any tone mark that may be present. Vowels are written above, below, before, and after the consonant they belong to. Some written consonants are silent markers that shift the tone of adjacent syllables.

thaiphon works through all of these rules systematically. Given a string of Thai characters, it produces:

  • A phonetic representation in one of several notations — IPA symbols, plain-ASCII Latin, Cyrillic, or learner-focused romanizations like RTL and Paiboon.
  • Tone information embedded in the notation, in the convention of each scheme (Chao tone letters, bracketed tags, combining diacritics on vowel letters, or spacing modifier characters).
  • Vowel length — Thai distinguishes short and long vowels phonemically, and thaiphon encodes this distinction in every scheme.
  • Optionally, the raw phonological structure as a Python object, so you can inspect onset, vowel, coda, and tone for each syllable individually.

A concrete example

Take the word น้ำ (water).

Written, it looks like three characters: น (no), ้ (mai tho tone mark), and ำ (sara am). thaiphon parses these as:

  • Onset: /n/ — the consonant น in onset position, belonging to the low consonant class.
  • Sara Am (ำ): a compound vowel marker that decomposes into a long /aː/ vowel plus a nasal /m/ coda.
  • Tone mark: mai tho (◌้), which on a low-class onset produces the high tone.
  • Vowel length: long.

The result in each scheme:

Scheme Output (html mode) Meaning
ipa /naːm˦˥/ /n/ onset, long /aː/, /m/ coda, high tone (Chao ˦˥)
tlc naam<sup>H</sup> same in Latin, tone as superscript
morev на̄мˇ Cyrillic, macron marks long vowel, ˇ after coda marks high tone

All three come from the same internal phonological word — the rendering is purely surface-level. Every built-in scheme draws on the same phonological analysis.

Want to try before installing?

The online tool at rianthai.pro/thai-transliteration runs this same engine in your browser. Paste a word or sentence and compare output across schemes — no Python needed.

What thaiphon does not do

  • Speech synthesis. thaiphon produces text, not audio.
  • Word-sense disambiguation. If a Thai word has multiple pronunciations depending on meaning or part of speech (a small set of cases), thaiphon takes the most common reading.
  • Full sentence prosody. Sandhi and connected-speech effects across word boundaries are not modelled.
  • Handwriting or image recognition. thaiphon takes Unicode text as input.

The eight built-in schemes

ipa — International Phonetic Alphabet, the linguist's notation. Syllables are separated by . and wrapped in /…/. Tones use Chao tone letters (˧ ˨˩ ˥˩ ˦˥ ˩˩˦). This is the scheme used for accuracy benchmarking against Wiktionary.

tlc — The "Enhanced Phonemic" notation used by thai-language.com. Written in plain ASCII Latin letters. Syllables are separated by spaces. Tones are tagged at the end: {M} mid, {L} low, {H} high, {F} falling, {R} rising. With format="html" the tags become <sup> elements.

morev — Cyrillic transliteration following the Morev/Plam/Fomicheva 1964 dictionary, used in Russian-language Thai teaching materials. Aspirated stops are written as digraphs (кх, тх, пх; ч bare for the aspirated palatal). With format="html" the second element of each aspiration digraph becomes a <sup> element. Tones are spacing modifier letters placed at the end of the syllable, after any coda; mid tone is unmarked.

lmt — Cyrillic transliteration from the Lipilina-Muzychenko-Thapanosoth 2018 MSU/ISAA learner textbook. Shares the same onset and coda Cyrillic letters as morev. Distinctive features: vowel length marked with an ASCII colon (а: for long /a/), tone shown as a Unicode superscript digit at the end of the syllable (⁰ mid, ¹ low, ² falling, ³ high, ⁴ rising), and a space as the syllable separator.

rtgs — The official Royal Thai General System of Transcription (2002 revision). Plain ASCII Latin output, no tone marks, no vowel length distinction. Used on road signs, in government publications, and for transliterated place names. Foreign-origin coda phonemes always collapse to the native Thai inventory.

rtl — The romanization used in Rak Thai Language School course materials. Aspirated stops are digraphs (ph th kh ch); /tɕ/ is written c. IPA vowel letters ʉ ɛ ɔ ə are doubled for length. Tone is a combining diacritic on the first vowel letter; mid tone carries a macron. Vowel-initial syllables receive a ʼ (U+02BC) onset. Syllable separator is a space.

paiboon — The original Paiboon Publishing romanization from the first-edition Thai learner series. Aspirated stops use bare letters (p t k ch); unaspirated voiceless stops use English-cluster digraphs (bp dt g j). Mid tone is unmarked. Centring diphthongs are spelled the same at both lengths (ia ʉa ua). Syllable separator is a hyphen.

paiboon_plus — The revised Paiboon system from the 2009 Three-Way dictionary. Identical to paiboon except long centring diphthongs are doubled in spelling (iia ʉʉa uua) to distinguish them from the short forms.

Reading profiles

Thai has register variation — the same word may be pronounced differently in casual speech, formal broadcast, or scholarly/learned recitation. thaiphon models this through four named profiles:

  • everyday (default) — colloquial urban speech.
  • careful_educated — formal broadcast register, retains more foreign codas.
  • learned_full — restores full Sanskrit/Pali learned readings for Indic-derived words.
  • etalon_compat — collapses every foreign coda to its native-Thai equivalent, matching dictionary-citation style.

See Reading profiles for details and examples.