Write your own

Write your own scheme

thaiphon's output is controlled by a SchemeMapping — a plain Python data structure that maps each phoneme in the engine's internal representation to a string in your target notation. You do not need to touch any phonological code. If you know what sounds Thai has and how you want to spell them, you can write a complete new scheme.


What a scheme needs to specify

A scheme maps these phoneme categories to surface strings:

  1. Onset consonants — one IPA symbol → your string (e.g. "kh")
  2. Vowels — a (quality, length) pair → your string (e.g. ("a", LONG)"aa")
  3. Codas — one IPA coda symbol → your string (e.g. "ŋ""ng")
  4. Tone — a function that takes a base syllable string and a Syllable object and returns the tone-decorated string
  5. Optional extras — syllable separator, cluster joiner, context-dependent overrides

All of this lives in a SchemeMapping dataclass. Instantiate one, pass it to a MappingRenderer, and register it.


A complete working example: RTGS-inspired scheme

The Royal Thai General System (RTGS) is an official romanization published by the Royal Institute of Thailand. The following example implements a close approximation:

from thaiphon.model.enums import VowelLength, Tone
from thaiphon.model.syllable import Syllable
from thaiphon.renderers.mapping import SchemeMapping, MappingRenderer
from thaiphon.registry import RENDERERS
from thaiphon import transcribe

# Step 1: Map onset IPA symbols to RTGS letters.
# The IPA symbols are the engine's internal representation — not the Thai letters.
ONSET_MAP = {
    "k":   "k",
    "kʰ":  "kh",
    "tɕ":  "ch",   # RTGS uses 'ch' for the unaspirated palatal affricate
    "tɕʰ": "ch",   # RTGS does not distinguish aspirated vs unaspirated here
    "d":   "d",
    "t":   "t",
    "tʰ":  "th",
    "b":   "b",
    "p":   "p",
    "pʰ":  "ph",
    "f":   "f",
    "s":   "s",
    "h":   "h",
    "ʔ":   "",     # glottal onset — RTGS does not write it
    "m":   "m",
    "n":   "n",
    "ŋ":   "ng",
    "j":   "y",
    "r":   "r",
    "l":   "l",
    "w":   "w",
}

# Step 2: Map (vowel quality, length) pairs to surface strings.
# VowelLength.SHORT and VowelLength.LONG come from the enum.
VOWEL_MAP = {
    ("a",  VowelLength.SHORT): "a",
    ("a",  VowelLength.LONG):  "a",    # RTGS does not mark vowel length
    ("i",  VowelLength.SHORT): "i",
    ("i",  VowelLength.LONG):  "i",
    ("u",  VowelLength.SHORT): "u",
    ("u",  VowelLength.LONG):  "u",
    ("e",  VowelLength.SHORT): "e",
    ("e",  VowelLength.LONG):  "e",
    ("ɛ",  VowelLength.SHORT): "ae",
    ("ɛ",  VowelLength.LONG):  "ae",
    ("o",  VowelLength.SHORT): "o",
    ("o",  VowelLength.LONG):  "o",
    ("ɔ",  VowelLength.SHORT): "o",
    ("ɔ",  VowelLength.LONG):  "o",
    ("ɯ",  VowelLength.SHORT): "ue",
    ("ɯ",  VowelLength.LONG):  "ue",
    ("ɤ",  VowelLength.SHORT): "oe",
    ("ɤ",  VowelLength.LONG):  "oe",
    ("iə", VowelLength.SHORT): "ia",
    ("iə", VowelLength.LONG):  "ia",
    ("ɯə", VowelLength.SHORT): "uea",
    ("ɯə", VowelLength.LONG):  "uea",
    ("uə", VowelLength.SHORT): "ua",
    ("uə", VowelLength.LONG):  "ua",
}

# Step 3: Map coda IPA symbols to surface strings.
CODA_MAP = {
    "m":  "m",
    "n":  "n",
    "ŋ":  "ng",
    "p̚":  "p",
    "t̚":  "t",
    "k̚":  "k",
    "w":  "o",   # offglide /w/ → 'o' in RTGS convention
    "j":  "i",   # offglide /j/ → 'i'
    "f":  "f",
}

# Step 4: Tone formatter.
# RTGS does not write tones, so we simply return the base string unchanged.
def tone_format(base: str, syl: Syllable) -> str:
    return base

# Step 5: Assemble the SchemeMapping.
RTGS_MAPPING = SchemeMapping(
    scheme_id="rtgs",
    onset_map=ONSET_MAP,
    vowel_map=VOWEL_MAP,
    coda_map=CODA_MAP,
    tone_format=tone_format,
    syllable_separator="",   # RTGS writes multi-syllable words without separator
    cluster_joiner="",
    empty_onset="",
    unknown_fallback="?",
)

# Step 6: Register the scheme.
# This only needs to run once — typically at import time.
if "rtgs" not in RENDERERS:
    RENDERERS.register("rtgs", lambda: MappingRenderer(RTGS_MAPPING))

# Step 7: Use it.
transcribe("สวัสดี", scheme="rtgs")
# 'sawatdi'

transcribe("กรุงเทพ", scheme="rtgs")
# 'krungthep'

SchemeMapping field reference

from thaiphon.renderers.mapping import SchemeMapping
Field Type Required Description
scheme_id str yes The string key used in transcribe(..., scheme="your_id")
onset_map Mapping[str, str] yes IPA onset symbol → surface string
vowel_map Mapping[tuple[str, VowelLength], str] yes (IPA quality, VowelLength) → surface string
coda_map Mapping[str, str] yes IPA coda symbol → surface string
tone_format Callable[[str, Syllable], str] yes Function that adds tone decoration to the base syllable string. Used for format="text" and as fallback for format="html".
tone_format_html Callable[[str, Syllable], str] \| None no Alternate tone formatter used only when format="html". When None, tone_format is used for both formats.
coda_context_map Mapping[tuple[str, VowelLength, str], str] \| None no Context-dependent coda overrides keyed by (vowel, length, coda-IPA)
vowel_context_map Mapping[tuple[str, VowelLength, str], str] \| None no Context-dependent vowel overrides keyed by (vowel, length, coda-IPA)
word_coda_override Callable[[str, Syllable, str, str], str \| None] \| None no Per-word coda override for loanword/profile-sensitive codas
cluster_joiner str no String inserted between the two phonemes of an onset cluster. Default: ""
syllable_separator str no Inserted between syllables. Default: "-"
empty_onset str no Rendered when a syllable has no onset consonant. Default: ""
unknown_fallback str no Rendered when a phoneme has no entry in the map. Default: "?"

The tone_format function

The tone_format callable receives two arguments:

  • base: str — the syllable rendered so far (onset + vowel + coda, concatenated).
  • syl: Syllable — the Syllable object, giving access to syl.tone (a Tone enum: MID, LOW, FALLING, HIGH, RISING).

Return the final syllable string, with whatever tone decoration your scheme uses.

No tone (RTGS-style):

def tone_format(base: str, syl: Syllable) -> str:
    return base

Bracketed tags (TLC-style):

from thaiphon.model.enums import Tone

_TAG = {Tone.MID: "{M}", Tone.LOW: "{L}", Tone.HIGH: "{H}", Tone.FALLING: "{F}", Tone.RISING: "{R}"}

def tone_format(base: str, syl: Syllable) -> str:
    return base + _TAG[syl.tone]

Superscript digit (pedagogical):

_DIGIT = {Tone.MID: "3", Tone.LOW: "2", Tone.HIGH: "4", Tone.FALLING: "5", Tone.RISING: "1"}

def tone_format(base: str, syl: Syllable) -> str:
    return base + _DIGIT[syl.tone]


Per-format tone markup with tone_format_html

If your scheme needs a different tone representation in HTML output — for example, wrapping the tone tag in a <sup> element — you can declare an optional tone_format_html callable on the mapping:

from thaiphon.model.enums import Tone
from thaiphon.model.syllable import Syllable

_TAG = {Tone.MID: "M", Tone.LOW: "L", Tone.HIGH: "H", Tone.FALLING: "F", Tone.RISING: "R"}

def tone_format_text(base: str, syl: Syllable) -> str:
    return base + "{" + _TAG[syl.tone] + "}"

def tone_format_html(base: str, syl: Syllable) -> str:
    return base + "<sup>" + _TAG[syl.tone] + "</sup>"

MY_MAPPING = SchemeMapping(
    scheme_id="my_scheme",
    ...,
    tone_format=tone_format_text,
    tone_format_html=tone_format_html,   # used only when format="html"
)

When transcribe(..., format="html") is called, the renderer uses tone_format_html. For format="text" (the default), it uses tone_format. If tone_format_html is None (the default when omitted), tone_format is used for both formats — so schemes that do not need the distinction need not declare it at all.


Context-dependent spellings

Some schemes need the coda or vowel representation to vary based on the surrounding phonemes. The coda_context_map and vowel_context_map fields support this.

Example: a scheme where the /j/ coda spells as "y" after /ɔː/ but "i" elsewhere:

CODA_CONTEXT = {
    ("ɔ", VowelLength.LONG, "j"): "y",
}

mapping = SchemeMapping(
    ...,
    coda_context_map=CODA_CONTEXT,
    coda_map={..., "j": "i", ...},
)

When thaiphon renders a syllable with vowel /ɔː/ and coda /j/, it checks coda_context_map first and finds "y". Any other vowel before /j/ falls through to coda_map and gets "i".


Sharing your scheme

If you have written a scheme that would be useful to other thaiphon users, consider contributing it:

  1. See Add a scheme for the conceptual walk-through.
  2. See Pull requests for how to propose the addition on GitHub.