transcribe

transcribe, transcribe_word, transcribe_sentence

The three transcription functions convert Thai text to a romanized or phonetic string.


transcribe

def transcribe(
    text: str,
    scheme: str = "tlc",
    *,
    format: Literal["text", "html"] = "text",
    profile: ReadingProfile = "everyday",
) -> str:

Transcribe a Thai word or short phrase into the target scheme.

Parameters

Parameter Type Default Description
text str Thai text to transcribe. NFC normalisation is applied automatically.
scheme str "tlc" The romanization scheme to use. Must be a registered scheme name.
format "text" | "html" "text" Output format. "text" returns a plain string. "html" activates per-scheme HTML rendering (e.g. superscript tone tags for TLC, superscript aspiration markup for Morev). Schemes without HTML-specific output return the same string as "text".
profile ReadingProfile "everyday" Reading profile. Controls register-sensitive pronunciation decisions.

Returns

str — the transcribed text in the requested scheme and format.

Raises

  • UnsupportedSchemeError — if scheme is not registered.
  • ValueError — if profile is not one of the four valid profile strings.

Examples

from thaiphon import transcribe

# Default scheme is TLC; use format="html" for superscript tone tags.
transcribe("น้ำ", format="html")
# 'naam<sup>H</sup>'

# IPA scheme — format has no effect, output is always the same.
transcribe("น้ำ", scheme="ipa")
# '/naːm˦˥/'

# Morev scheme — format="html" emits superscript aspiration markup.
transcribe("ขอ", scheme="morev", format="html")
# 'к<sup>х</sup>о̄´'

# Morev without html — aspiration written as plain digraph кх / тх / пх.
transcribe("น้ำ", scheme="morev")
# 'на̄мˇ'

# TLC text mode — bracketed tags instead of superscripts.
transcribe("น้ำ", scheme="tlc")
# 'naam{H}'

# Reading profile.
transcribe("ลิฟต์", scheme="ipa", profile="everyday")
# '/lif˦˥/'

transcribe("ลิฟต์", scheme="ipa", profile="etalon_compat")
# '/lip̚˦˥/'

# Empty input returns empty string.
transcribe("", scheme="tlc")
# ''

Notes

  • transcribe calls analyze internally. For multiple transcriptions of the same word in different schemes, it is more efficient to call analyze once and render with each scheme's renderer.render_word.
  • The default scheme is "tlc". To check which schemes are available, call list_schemes().
  • NFC normalisation ensures NFD and NFC input produce identical output.

transcribe_word

def transcribe_word(
    text: str,
    scheme: str = "tlc",
    *,
    format: Literal["text", "html"] = "text",
    profile: ReadingProfile = "everyday",
) -> str:

Identical to transcribe. Provided as an explicit alternative when the caller wants to signal that the input is a single known word (rather than a possibly multi-word phrase).

Example

from thaiphon import transcribe_word

transcribe_word("สวัสดี", scheme="ipa")
# '/sa˨˩.wat̚˨˩.diː˧/'

transcribe_sentence

def transcribe_sentence(
    text: str,
    scheme: str = "tlc",
    *,
    format: Literal["text", "html"] = "text",
    profile: ReadingProfile = "everyday",
    segmenter: Callable[[str], Sequence[str]] | None = None,
) -> str:

Segment text into words, transcribe each word, and join the results with spaces.

Parameters

Parameter Type Default Description
text str Full sentence or multi-word string.
scheme str "tlc" Romanization scheme.
format "text" | "html" "text" Output format.
profile ReadingProfile "everyday" Reading profile.
segmenter Callable[[str], Sequence[str]] \| None None Custom word segmenter. If None, uses the built-in longest-match segmenter (with pythainlp if available).

Returns

str — transcribed words joined by spaces. Empty string if input is empty or whitespace-only.

Examples

from thaiphon import transcribe_sentence

# Use a custom segmenter (or pythainlp) for reliable sentence splitting.
def my_segmenter(text: str) -> list[str]:
    return text.split()   # split on spaces (pre-segmented input)

transcribe_sentence("ฉัน ชอบ กิน ข้าว", scheme="ipa", segmenter=my_segmenter)
# '/tɕʰan˩˩˦/ /tɕʰɔːp̚˥˩/ /kin˧/ /kʰaːw˥˩/'

transcribe_sentence("ฉัน ชอบ กิน ข้าว", scheme="tlc", format="html", segmenter=my_segmenter)
# 'chan<sup>R</sup> chaawp<sup>F</sup> gin<sup>M</sup> khaao<sup>F</sup>'

transcribe_sentence("กา นก ปลา", scheme="tlc", format="html", segmenter=my_segmenter)
# 'gaa<sup>M</sup> nohk<sup>H</sup> bplaa<sup>M</sup>'

Notes

  • Words appearing mid-compound have their vowel-length overrides suppressed, matching the colloquial shortening of vowels in non-final position. Words at the end of the sentence receive the full override.
  • If pythainlp is installed and importable, the default segmenter uses it. Otherwise, the built-in longest-match segmenter is used.
  • Whitespace tokens in the segmentation output are skipped.

list_schemes

def list_schemes() -> tuple[str, ...]:

Return a sorted tuple of registered scheme identifiers.

Returns

tuple[str, ...] — sorted tuple of registered scheme IDs, e.g. ('ipa', 'morev', 'paiboon', 'paiboon_plus', 'rtl', 'tlc').

Example

from thaiphon import list_schemes

list_schemes()
# ('ipa', 'morev', 'paiboon', 'paiboon_plus', 'rtl', 'tlc')

Notes

list_schemes() triggers the import of the built-in renderers module, which registers all six built-in schemes. Any additional schemes you have registered with RENDERERS.register also appear.


ReadingProfile

ReadingProfile = Literal["everyday", "careful_educated", "learned_full", "etalon_compat"]

The four valid profile strings:

Value Register
"everyday" Colloquial urban speech (default)
"careful_educated" Formal broadcast register
"learned_full" Full Indic/Sanskrit citation forms
"etalon_compat" Dictionary-citation, collapses foreign codas

See Reading profiles for details and examples.