transcribe, transcribe_word, transcribe_sentence¶
The three transcription functions convert Thai text to a romanized or phonetic string.
transcribe¶
def transcribe(
text: str,
scheme: str = "tlc",
*,
format: Literal["text", "html"] = "text",
profile: ReadingProfile = "everyday",
) -> str:
Transcribe a Thai word or short phrase into the target scheme.
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
— | Thai text to transcribe. NFC normalisation is applied automatically. |
scheme |
str |
"tlc" |
The romanization scheme to use. Must be a registered scheme name. |
format |
"text" | "html" |
"text" |
Output format. "text" returns a plain string. "html" activates per-scheme HTML rendering (e.g. superscript tone tags for TLC, superscript aspiration markup for Morev). Schemes without HTML-specific output return the same string as "text". |
profile |
ReadingProfile |
"everyday" |
Reading profile. Controls register-sensitive pronunciation decisions. |
Returns¶
str — the transcribed text in the requested scheme and format.
Raises¶
UnsupportedSchemeError— ifschemeis not registered.ValueError— ifprofileis not one of the four valid profile strings.
Examples¶
from thaiphon import transcribe
# Default scheme is TLC; use format="html" for superscript tone tags.
transcribe("น้ำ", format="html")
# 'naam<sup>H</sup>'
# IPA scheme — format has no effect, output is always the same.
transcribe("น้ำ", scheme="ipa")
# '/naːm˦˥/'
# Morev scheme — format="html" emits superscript aspiration markup.
transcribe("ขอ", scheme="morev", format="html")
# 'к<sup>х</sup>о̄´'
# Morev without html — aspiration written as plain digraph кх / тх / пх.
transcribe("น้ำ", scheme="morev")
# 'на̄мˇ'
# TLC text mode — bracketed tags instead of superscripts.
transcribe("น้ำ", scheme="tlc")
# 'naam{H}'
# Reading profile.
transcribe("ลิฟต์", scheme="ipa", profile="everyday")
# '/lif˦˥/'
transcribe("ลิฟต์", scheme="ipa", profile="etalon_compat")
# '/lip̚˦˥/'
# Empty input returns empty string.
transcribe("", scheme="tlc")
# ''
Notes¶
transcribecallsanalyzeinternally. For multiple transcriptions of the same word in different schemes, it is more efficient to callanalyzeonce and render with each scheme'srenderer.render_word.- The default scheme is
"tlc". To check which schemes are available, calllist_schemes(). - NFC normalisation ensures NFD and NFC input produce identical output.
transcribe_word¶
def transcribe_word(
text: str,
scheme: str = "tlc",
*,
format: Literal["text", "html"] = "text",
profile: ReadingProfile = "everyday",
) -> str:
Identical to transcribe. Provided as an explicit alternative when the caller wants to signal that the input is a single known word (rather than a possibly multi-word phrase).
Example¶
transcribe_sentence¶
def transcribe_sentence(
text: str,
scheme: str = "tlc",
*,
format: Literal["text", "html"] = "text",
profile: ReadingProfile = "everyday",
segmenter: Callable[[str], Sequence[str]] | None = None,
) -> str:
Segment text into words, transcribe each word, and join the results with spaces.
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
— | Full sentence or multi-word string. |
scheme |
str |
"tlc" |
Romanization scheme. |
format |
"text" | "html" |
"text" |
Output format. |
profile |
ReadingProfile |
"everyday" |
Reading profile. |
segmenter |
Callable[[str], Sequence[str]] \| None |
None |
Custom word segmenter. If None, uses the built-in longest-match segmenter (with pythainlp if available). |
Returns¶
str — transcribed words joined by spaces. Empty string if input is empty or whitespace-only.
Examples¶
from thaiphon import transcribe_sentence
# Use a custom segmenter (or pythainlp) for reliable sentence splitting.
def my_segmenter(text: str) -> list[str]:
return text.split() # split on spaces (pre-segmented input)
transcribe_sentence("ฉัน ชอบ กิน ข้าว", scheme="ipa", segmenter=my_segmenter)
# '/tɕʰan˩˩˦/ /tɕʰɔːp̚˥˩/ /kin˧/ /kʰaːw˥˩/'
transcribe_sentence("ฉัน ชอบ กิน ข้าว", scheme="tlc", format="html", segmenter=my_segmenter)
# 'chan<sup>R</sup> chaawp<sup>F</sup> gin<sup>M</sup> khaao<sup>F</sup>'
transcribe_sentence("กา นก ปลา", scheme="tlc", format="html", segmenter=my_segmenter)
# 'gaa<sup>M</sup> nohk<sup>H</sup> bplaa<sup>M</sup>'
Notes¶
- Words appearing mid-compound have their vowel-length overrides suppressed, matching the colloquial shortening of vowels in non-final position. Words at the end of the sentence receive the full override.
- If
pythainlpis installed and importable, the default segmenter uses it. Otherwise, the built-in longest-match segmenter is used. - Whitespace tokens in the segmentation output are skipped.
list_schemes¶
Return a sorted tuple of registered scheme identifiers.
Returns¶
tuple[str, ...] — sorted tuple of registered scheme IDs, e.g. ('ipa', 'morev', 'paiboon', 'paiboon_plus', 'rtl', 'tlc').
Example¶
from thaiphon import list_schemes
list_schemes()
# ('ipa', 'morev', 'paiboon', 'paiboon_plus', 'rtl', 'tlc')
Notes¶
list_schemes() triggers the import of the built-in renderers module, which registers all six built-in schemes. Any additional schemes you have registered with RENDERERS.register also appear.
ReadingProfile¶
The four valid profile strings:
| Value | Register |
|---|---|
"everyday" |
Colloquial urban speech (default) |
"careful_educated" |
Formal broadcast register |
"learned_full" |
Full Indic/Sanskrit citation forms |
"etalon_compat" |
Dictionary-citation, collapses foreign codas |
See Reading profiles for details and examples.