Add a romanization scheme¶

This page explains what a scheme is, how to write one, and how to contribute it to thaiphon. It is written for people who know Thai phonology but may not have a software development background.

What a scheme is¶

A romanization scheme is a set of rules that spells out each sound in a target writing system. For Thai, a scheme answers questions like:

How do you write the aspirated /k/ sound — kh? k'? К?
How do you write a high tone — a number? A diacritic? A bracketed tag?
How do you write a long vowel — double the letter? Add a colon?

In thaiphon, a scheme is a SchemeMapping — a Python data structure with four required parts:

Part	What it does
`onset_map`	Maps each onset phoneme to a string
`vowel_map`	Maps each (vowel quality, length) pair to a string
`coda_map`	Maps each coda phoneme to a string
`tone_format`	A function that adds tone decoration to the syllable string

Everything else — the phonological analysis, the syllabification, the tone derivation — is already done by the engine. Your scheme only controls the final spelling step.

The phonemes you need to handle¶

Onset phonemes (IPA symbols)¶

These are the phonemes the engine will pass to your onset_map:

k   kʰ   tɕ   tɕʰ   d   t   tʰ   b   p   pʰ   f   s   h   ʔ
m   n   ŋ   j   r   l   w

You must provide a mapping for each one. If you use an empty string "" for any phoneme (e.g. for the glottal stop ʔ), that phoneme will produce no output.

Vowel phonemes (IPA quality + length)¶

Each vowel is a pair: the IPA quality symbol and a length (SHORT or LONG). The qualities are:

a   i   u   e   ɛ   o   ɔ   ɯ   ɤ   iə   ɯə   uə

So your vowel_map needs entries like:

("a", VowelLength.SHORT): "a",
("a", VowelLength.LONG):  "aa",

Coda phonemes (IPA symbols)¶

m   n   ŋ   p̚   t̚   k̚   w   j   f

Note: p̚, t̚, k̚ have a combining unreleased mark (U+031A). You may need to copy-paste these from these docs or from the source code.

The f coda appears only when a loanword with preserved /f/ is rendered.

A minimal example¶

Suppose you want a scheme that uses number superscripts for tones (1=rising, 2=low, 3=mid, 4=high, 5=falling) and doubled letters for long vowels:

from thaiphon.model.enums import VowelLength, Tone
from thaiphon.model.syllable import Syllable
from thaiphon.renderers.mapping import SchemeMapping, MappingRenderer
from thaiphon.registry import RENDERERS
from thaiphon import transcribe

_ONSET = {
    "k": "k", "kʰ": "kh", "tɕ": "j", "tɕʰ": "ch",
    "d": "d", "t": "dt", "tʰ": "th", "b": "b", "p": "bp", "pʰ": "ph",
    "f": "f", "s": "s", "h": "h", "ʔ": "",
    "m": "m", "n": "n", "ŋ": "ng", "j": "y", "r": "r", "l": "l", "w": "w",
}

_VOWEL = {
    ("a",  VowelLength.SHORT): "a",   ("a",  VowelLength.LONG): "aa",
    ("i",  VowelLength.SHORT): "i",   ("i",  VowelLength.LONG): "ii",
    ("u",  VowelLength.SHORT): "u",   ("u",  VowelLength.LONG): "uu",
    ("e",  VowelLength.SHORT): "e",   ("e",  VowelLength.LONG): "ee",
    ("ɛ",  VowelLength.SHORT): "ae",  ("ɛ",  VowelLength.LONG): "aae",
    ("o",  VowelLength.SHORT): "o",   ("o",  VowelLength.LONG): "oo",
    ("ɔ",  VowelLength.SHORT): "aw",  ("ɔ",  VowelLength.LONG): "aaw",
    ("ɯ",  VowelLength.SHORT): "eu",  ("ɯ",  VowelLength.LONG): "euu",
    ("ɤ",  VowelLength.SHORT): "er",  ("ɤ",  VowelLength.LONG): "err",
    ("iə", VowelLength.SHORT): "ia",  ("iə", VowelLength.LONG): "ia",
    ("ɯə", VowelLength.SHORT): "ua",  ("ɯə", VowelLength.LONG): "ua",
    ("uə", VowelLength.SHORT): "uea", ("uə", VowelLength.LONG): "uea",
}

_CODA = {
    "m": "m", "n": "n", "ŋ": "ng",
    "p̚": "p", "t̚": "t", "k̚": "k",
    "w": "o", "j": "i", "f": "f",
}

_TONE_NUM = {
    Tone.RISING: "1", Tone.LOW: "2", Tone.MID: "3",
    Tone.HIGH: "4", Tone.FALLING: "5",
}

def _tone_format(base: str, syl: Syllable) -> str:
    return base + _TONE_NUM[syl.tone]

MY_MAPPING = SchemeMapping(
    scheme_id="myscheme",
    onset_map=_ONSET,
    vowel_map=_VOWEL,
    coda_map=_CODA,
    tone_format=_tone_format,
    syllable_separator="-",
    cluster_joiner="",
    empty_onset="",
    unknown_fallback="?",
)

if "myscheme" not in RENDERERS:
    RENDERERS.register("myscheme", lambda: MappingRenderer(MY_MAPPING))

transcribe("สวัสดี", scheme="myscheme")
# 'sawat2-dii3'

Context-dependent spellings (advanced)¶

Some schemes need to spell a coda or vowel differently depending on what surrounds it. Use coda_context_map and vowel_context_map:

# Example: spell the /j/ coda as "y" after /ɔː/, "i" elsewhere.
from thaiphon.model.enums import VowelLength

CODA_CONTEXT = {
    ("ɔ", VowelLength.LONG, "j"): "y",
}

mapping = SchemeMapping(
    ...,
    coda_map={..., "j": "i"},     # default
    coda_context_map=CODA_CONTEXT, # override for specific contexts
)

Checking your scheme's output¶

Once registered, use transcribe to check a set of known words:

test_words = {
    "กา": "crow",
    "ขา": "leg",
    "มา": "to come",
    "ลิฟต์": "elevator",
    "สวัสดี": "hello",
    "น้ำ": "water",
}

for thai, gloss in test_words.items():
    print(f"{thai} ({gloss}): {transcribe(thai, scheme='myscheme')}")

Compare your output against known romanizations from Thai textbooks or dictionaries to verify correctness.

Contributing your scheme¶

See Pull requests for how to propose your scheme to the project.

When contributing, place your scheme file in src/thaiphon/renderers/my_scheme.py, register it at module level (with the if "my_scheme" not in RENDERERS: guard), and import it from src/thaiphon/renderers/__init__.py.

Add a few representative test cases to tests/test_renderers/test_my_scheme.py. See the existing renderer tests (tests/test_renderers/test_ipa.py, etc.) for the expected structure.