Your first transcription

Your first transcription

This page walks through the most common operations you will perform with thaiphon, with explanations of what each piece of output means.

Before you start

Make sure thaiphon is installed. If not, see Install or Install without Python experience.


The transcribe function

transcribe is the main entry point. It takes a Thai string and returns a romanization.

from thaiphon import transcribe

transcribe("สวัสดี", format="html")
# 'sa<sup>L</sup> wat<sup>L</sup> dee<sup>M</sup>'

The default scheme is tlc (thai-language.com Enhanced Phonemic). Pass format="html" to get tone tags as <sup> elements — the form shown throughout these docs. The plain-text form uses bracketed tags ({L}, {M}, etc.) and is available with the default format="text".

Choosing a scheme

# International Phonetic Alphabet
transcribe("สวัสดี", scheme="ipa")
# '/sa˨˩.wat̚˨˩.diː˧/'

# Cyrillic (Morev tradition) — format="html" gives superscript aspiration marks
transcribe("สวัสดี", scheme="morev", format="html")
# 'саˆ-ватˆ-дӣ'

# thai-language.com notation with superscript tones (html mode)
transcribe("สวัสดี", scheme="tlc", format="html")
# 'sa<sup>L</sup> wat<sup>L</sup> dee<sup>M</sup>'

See Schemes for a full comparison.

Sentence-level input

Use transcribe_sentence when your input contains pre-segmented words (separated by spaces or punctuation). It transcribes each token and joins results with spaces:

from thaiphon import transcribe_sentence, transcribe_word

# Transcribe individual words — most reliable for single known words.
transcribe_word("ฉัน", scheme="ipa")    # '/tɕʰan˩˩˦/'
transcribe_word("ชอบ", scheme="ipa")    # '/tɕʰɔːp̚˥˩/'
transcribe_word("กิน", scheme="ipa")    # '/kin˧/'
transcribe_word("ข้าว", scheme="ipa")   # '/kʰaːw˥˩/'

Word segmentation

transcribe_sentence uses a dictionary-based longest-match segmenter. Results depend on which words are in the built-in dictionary. For best results with sentences, pre-segment your text and pass individual words to transcribe_word, or install pythainlp for improved automatic segmentation.

For single known words, transcribe_word is equivalent to transcribe:

from thaiphon import transcribe_word

transcribe_word("น้ำ", scheme="ipa")
# '/naːm˦˥/'

Reading the IPA output

If you choose scheme ipa, the output uses standard IPA conventions:

Symbol Meaning
/…/ phonemic slashes wrapping the whole word
. syllable boundary
ː long vowel (e.g. = long /a/)
unreleased stop codas
˧ mid tone
˨˩ low tone
˥˩ falling tone
˦˥ high tone
˩˩˦ rising tone

Example: /naːm˦˥/ = onset /n/, long /aː/ vowel, /m/ coda, high tone.


Reading the TLC output

The tlc scheme uses plain ASCII letters and is readable without special fonts:

Element Notation
Long vowels doubled letter: aa, ee, uu
Aspirated stops kh, th, ph, ch
Unaspirated stops g (k), dt (t), bp (p), j (tɕ)
Tones {M} mid, {L} low, {H} high, {F} falling, {R} rising

Example: naam{H} = /n/ onset, long /aa/, /m/ coda, high tone.


A sample word list

from thaiphon import transcribe

words = {
    "สวัสดี": "hello",
    "น้ำ":    "water",
    "ข้าว":   "rice",
    "รัก":    "love",
    "ปลา":    "fish",
    "ภาษาไทย": "Thai language",
    "กรุงเทพ": "Bangkok",
    "ผลไม้":  "fruit",
}

for thai, gloss in words.items():
    ipa = transcribe(thai, scheme="ipa")
    tlc = transcribe(thai, scheme="tlc", format="html")
    print(f"{thai:12} ({gloss:15}) IPA: {ipa:30} TLC: {tlc}")

Accessing the phonological structure

If you need more than a string — for instance, to know the tone of each syllable individually — use analyze:

from thaiphon import analyze

result = analyze("ผลไม้")

for syl in result.best.syllables:
    print(
        f"onset={syl.onset.symbol if syl.onset else '∅':6} "
        f"vowel={syl.vowel.symbol:4} "
        f"length={syl.vowel_length.name:6} "
        f"coda={syl.coda.symbol if syl.coda else '∅':4} "
        f"tone={syl.tone.name}"
    )

Output (ผลไม้ has three syllables: ผล = pʰ+ɔ+n, ล = l+a, ไม้ = m+aː+j):

onset=pʰ     vowel=ɔ    length=SHORT  coda=n    tone=RISING
onset=l      vowel=a    length=SHORT  coda=∅    tone=HIGH
onset=m      vowel=a    length=LONG   coda=j    tone=HIGH

See analyze for the full API documentation.


Next steps