Override lexicons¶

thaiphon derives Thai phonology from orthographic rules and a built-in vocabulary. For most words the rule-based output is correct. But some words resist derivation: ceremonial compounds whose morpheme boundaries the segmenter cannot infer, domain-specific terms where your site holds a more authoritative form, or entries from corpora you cannot distribute upstream. Override lexicons are the escape hatch.

The idea in one paragraph¶

You register a Python callable. When thaiphon processes a Thai word, it calls your callable first — before consulting any built-in lexicon and before running the derivation rules. If your callable returns a PhonologicalWord, that result goes straight to the renderer; the rest of the pipeline is skipped. If it returns None, processing continues normally. Multiple callables can be stacked in priority order.

The library provides only the hook. Storage — an in-memory dict, a SQLite file, an HTTP call — is entirely your choice.

Why this exists¶

Pathological compounds¶

Thai Sanskrit-Pali compounds — ceremonial, religious, and legal vocabulary — sometimes have morpheme boundaries that only lexical knowledge can resolve. The full ceremonial name of Bangkok contains eleven morpheme boundaries that the rule-based segmenter cannot correctly infer from orthography alone. Rule-based derivation of such words produces a plausible but incorrect syllabification. A single hand-curated override entry fixes it immediately, without waiting for a library release.

Domain-specific correctness¶

A site running a Thai learner dictionary, a place-name gazetteer, a legal-terms database, or a brand-name lookup may have authoritative phonological forms that disagree with thaiphon's output for specific words. The override hook lets those sites inject their ground truth as a drop-in, with no changes to the engine.

Private corpora¶

Some authoritative phonological resources are licensed such that a consumer can use derived forms internally but cannot redistribute them. Override lexicons let consumers benefit from such data without contributing it to thaiphon's built-in vocabulary.

Where overrides sit in the pipeline¶

Thai text
    │
    ▼
NFC normalisation + Sara-Am expansion
    │
    ▼
Override lexicons  ← your callable is called here
    │ hit: return PhonologicalWord, skip everything below
    │ miss: continue
    ▼
Built-in lexicons (VOLUBILIS exact, Indic learned, calendar, …)
    │
    ▼
Syllabification + rule-based derivation
    │
    ▼
PhonologicalWord
    │
    ▼
Renderer → IPA / TLC / Morev / RTGS / Paiboon / LMT

Your callable receives the word in its post-normalisation Thai form — NFC-normalised, Sara-Am expanded, mark order canonicalised. You do not need to replicate thaiphon's normalisation pass.

One override feeds every renderer automatically. Register a PhonologicalWord once and IPA, TLC, Morev, RTGS, Paiboon, and LMT all render from the same structure.

Quick start¶

from thaiphon import register_lexicon, transcribe
from thaiphon.model.word import PhonologicalWord
from thaiphon.model.syllable import Syllable
from thaiphon.model.phoneme import Phoneme
from thaiphon.model.enums import VowelLength, Tone, ToneMark, EffectiveClass, SyllableType

# Build a PhonologicalWord for กรุงเทพ (Bangkok).
# The pipeline's default segmentation often treats this as two syllables;
# an override locks in the form you want.
bangkok = PhonologicalWord(
    syllables=(
        Syllable(
            onset=Phoneme("k"),
            vowel=Phoneme("u"),
            vowel_length=VowelLength.SHORT,
            coda=Phoneme("ŋ"),
            tone=Tone.MID,
            tone_mark=ToneMark.NONE,
            effective_class=EffectiveClass.MID,
            syllable_type=SyllableType.LIVE,
        ),
        Syllable(
            onset=Phoneme("tʰ", is_aspirated=True),
            vowel=Phoneme("eː"),
            vowel_length=VowelLength.LONG,
            coda=Phoneme("p̚"),
            tone=Tone.HIGH,
            tone_mark=ToneMark.NONE,
            effective_class=EffectiveClass.HIGH,
            syllable_type=SyllableType.DEAD,
        ),
    ),
)

OVERRIDES = {"กรุงเทพ": bangkok}

register_lexicon(lambda w: OVERRIDES.get(w), name="my-site")

print(transcribe("กรุงเทพ", scheme="ipa"))
# '/kruŋ˧.tʰeːp̚˥/'

print(transcribe("กรุงเทพ", scheme="tlc"))
# 'groong{M} thaep{H}'

The source field of the returned AnalysisResult is set to 'override:my-site', so you can tell which layer answered:

from thaiphon import analyze

result = analyze("กรุงเทพ")
print(result.source)   # 'override:my-site'

Building a PhonologicalWord¶

The main thing you need to know is how to construct the data structure. Every syllable is a Syllable frozen dataclass; every phoneme is a Phoneme (or Cluster for two-consonant onsets). The fields map directly to the phonological concepts described in the phonological model.

Phoneme¶

from thaiphon.model.phoneme import Phoneme, Cluster

Phoneme(symbol: str, is_aspirated: bool = False, is_sonorant: bool = False)

Use IPA symbols for symbol. A few conventions:

What you want	Symbol
Aspirated onset	`Phoneme("kʰ", is_aspirated=True)`
Sonorant onset	`Phoneme("n", is_sonorant=True)`
Long vowel	`Phoneme("aː")`
Short vowel	`Phoneme("a")`
Unreleased stop coda	`Phoneme("p̚")` or `Phoneme("t̚")` or `Phoneme("k̚")`
Sonorant coda	`Phoneme("m")`, `Phoneme("n")`, `Phoneme("ŋ")`
Glide coda	`Phoneme("j")`, `Phoneme("w")`

For a two-consonant onset cluster such as /pl/ (ปลา):

onset = Cluster(first=Phoneme("p"), second=Phoneme("l", is_sonorant=True))

Syllable¶

from thaiphon.model.syllable import Syllable
from thaiphon.model.enums import VowelLength, Tone, ToneMark, EffectiveClass, SyllableType

Syllable(
    onset,          # Phoneme | Cluster | None
    vowel,          # Phoneme
    vowel_length,   # VowelLength.SHORT | VowelLength.LONG
    coda,           # Phoneme | None  (None for open syllables)
    tone,           # Tone.MID | .LOW | .FALLING | .HIGH | .RISING
    tone_mark,      # ToneMark.NONE | .MAI_EK | .MAI_THO | .MAI_TRI | .MAI_JATTAWA
    effective_class,# EffectiveClass.HIGH | .MID | .LOW
    syllable_type,  # SyllableType.LIVE | SyllableType.DEAD
)

All four enumeration arguments have defaults, so you only need to pass the ones that matter:

# Open syllable with default effective_class and syllable_type
Syllable(
    onset=Phoneme("k"),
    vowel=Phoneme("aː"),
    vowel_length=VowelLength.LONG,
    coda=None,
    tone=Tone.MID,
)

tone is the actual surface tone — one of the five Thai tones. This is what renderers use. tone_mark records which diacritic was written in the orthography, if any. For override entries where you already know the surface tone, set tone_mark=ToneMark.NONE unless you specifically want to record the orthographic mark.

effective_class is the consonant class used when the tone was derived. For override entries, set it to the class that corresponds to the onset consonant. Mid-class consonants, and low-class consonants modified by a leading ห to act as high-class, use EffectiveClass.HIGH; unmarked low-class sonorants use EffectiveClass.LOW; bare mid-class consonants use EffectiveClass.MID. Renderers do not inspect this field, but it is carried through for diagnostics. When in doubt, EffectiveClass.MID is a safe placeholder.

syllable_type is DEAD for syllables ending in an unreleased stop (p̚, t̚, k̚) or a short vowel with no coda; LIVE for everything else.

PhonologicalWord¶

from thaiphon.model.word import PhonologicalWord

PhonologicalWord(
    syllables: tuple[Syllable, ...],
    morpheme_boundaries: tuple[int, ...] = (),  # syllable indices where morpheme splits occur
    confidence: float = 1.0,
    source: str = "derivation",  # overridden by the pipeline to 'override:<name>'
    raw: str = "",               # overridden by the pipeline from the input text
)

The pipeline overwrites source and raw when it serves an override result, so you can leave them at their defaults.

morpheme_boundaries records where a compound word splits at the morpheme level. For a compound such as กรุงเทพ (two morphemes: กรุง and เทพ), pass morpheme_boundaries=(1,) — meaning there is a boundary after syllable index 1. This information is available to renderers that use it (currently for diagnostic purposes).

Worked examples¶

Example 1: in-memory dict¶

The simplest possible override: a plain Python dict.

from thaiphon import register_lexicon, transcribe, analyze
from thaiphon.model.word import PhonologicalWord
from thaiphon.model.syllable import Syllable
from thaiphon.model.phoneme import Phoneme
from thaiphon.model.enums import VowelLength, Tone, EffectiveClass, SyllableType

def _syllable(onset_sym, vowel_sym, length, coda_sym, tone, ec, st):
    return Syllable(
        onset=Phoneme(onset_sym) if onset_sym else None,
        vowel=Phoneme(vowel_sym),
        vowel_length=length,
        coda=Phoneme(coda_sym) if coda_sym else None,
        tone=tone,
        effective_class=ec,
        syllable_type=st,
    )

V = VowelLength
T = Tone
EC = EffectiveClass
ST = SyllableType

SITE_OVERRIDES: dict[str, PhonologicalWord] = {
    # กรุงเทพ — two-syllable compound
    "กรุงเทพ": PhonologicalWord(
        syllables=(
            _syllable("k", "u", V.SHORT, "ŋ", T.MID, EC.MID, ST.LIVE),
            _syllable("tʰ", "eː", V.LONG, "p̚", T.HIGH, EC.HIGH, ST.DEAD),
        ),
        morpheme_boundaries=(1,),
    ),
    # สิงหาคม — Indic month name, three syllables
    "สิงหาคม": PhonologicalWord(
        syllables=(
            _syllable("s", "i", V.SHORT, "ŋ", T.MID, EC.MID, ST.LIVE),
            _syllable("h", "aː", V.LONG, None, T.MID, EC.LOW, ST.LIVE),
            _syllable("kʰ", "o", V.SHORT, "m", T.MID, EC.HIGH, ST.LIVE),
        ),
    ),
}

register_lexicon(lambda w: SITE_OVERRIDES.get(w), name="site-vocab")

# Override resolves first.
result = analyze("กรุงเทพ")
print(result.source)                        # 'override:site-vocab'
print(transcribe("กรุงเทพ", scheme="ipa"))  # '/kruŋ˧.tʰeːp̚˥/'

# Non-override words go through the normal pipeline.
print(transcribe("สวัสดี", scheme="ipa"))   # '/sa˨˩.wat̚˨˩.diː˧/'

Example 2: SQLite-backed override store¶

For larger override collections, a SQLite file scales without occupying process memory up front.

Schema:

-- One row per Thai word.
CREATE TABLE overrides (
    thai_word TEXT PRIMARY KEY,
    payload   TEXT NOT NULL       -- JSON-serialised PhonologicalWord
) WITHOUT ROWID;

The payload column holds a serialised PhonologicalWord. The serialisation format is your choice. JSON with a custom encoder is the most portable option — human-readable and inspectable in any SQLite browser. A bespoke column-per-field layout gives full SQL queryability over individual phoneme fields at the cost of a more complex schema.

The lookup function:

import json
import sqlite3
import threading
from thaiphon import register_lexicon
from thaiphon.model.word import PhonologicalWord

DB_PATH = "/var/data/my-app/overrides.db"

_local = threading.local()

def _conn() -> sqlite3.Connection:
    if not hasattr(_local, "conn"):
        _local.conn = sqlite3.connect(
            f"file:{DB_PATH}?mode=ro&immutable=1",
            uri=True,
            check_same_thread=False,
        )
    return _local.conn

def _from_json(data: str) -> PhonologicalWord:
    # Implement your own deserialiser that matches how you serialised.
    ...

def _lookup(word: str) -> PhonologicalWord | None:
    row = _conn().execute(
        "SELECT payload FROM overrides WHERE thai_word = ?", (word,)
    ).fetchone()
    if row is None:
        return None
    return _from_json(row[0])

register_lexicon(_lookup, name="sqlite-overrides")

The threading.local() connection pattern means each thread opens its own connection on first use, with no locking overhead. This is the same pattern used by thaiphon-data-volubilis.

Populating the database:

import sqlite3
from thaiphon.model.word import PhonologicalWord

def _to_json(word: PhonologicalWord) -> str:
    # Implement your own serialiser.
    ...

conn = sqlite3.connect("/var/data/my-app/overrides.db")
conn.execute("""
    CREATE TABLE IF NOT EXISTS overrides (
        thai_word TEXT PRIMARY KEY,
        payload   TEXT NOT NULL
    ) WITHOUT ROWID
""")
conn.execute(
    "INSERT OR REPLACE INTO overrides VALUES (?, ?)",
    ("กรุงเทพ", _to_json(bangkok)),
)
conn.commit()
conn.close()

Example 3: priority-ordered layers¶

When you have multiple override sources — for example, a site-specific dictionary at low priority and a per-session user correction at high priority — register them separately and let the priority argument control resolution order.

from thaiphon import register_lexicon, registered_lexicons, transcribe

# Site-wide vocabulary — lower priority (default 0).
register_lexicon(lambda w: SITE_VOCAB.get(w), name="site-vocab", priority=0)

# Per-session user corrections — higher priority.
register_lexicon(lambda w: SESSION_VOCAB.get(w), name="session-corrections", priority=10)

# Resolution order: highest priority first.
print(registered_lexicons())
# ('session-corrections', 'site-vocab')

# If SESSION_VOCAB has an entry for a word, it wins over SITE_VOCAB.
# If SESSION_VOCAB returns None, SITE_VOCAB is tried next.
# If both return None, the built-in pipeline runs as normal.

registered_lexicons() always returns names in resolution order, so you can confirm the stack at runtime.

To remove a layer — for example, when a session ends:

from thaiphon import unregister_lexicon

unregister_lexicon("session-corrections")  # returns True if found, False if not
print(registered_lexicons())
# ('site-vocab',)

Timing of registration¶

Register your lexicons before the first transcribe or analyze call that should use them. Module-level registration (at import time) is the safest pattern for application code:

# my_app/thaiphon_setup.py — import this module early in your application startup.

from thaiphon import register_lexicon
from my_app.overrides import SITE_VOCAB

register_lexicon(lambda w: SITE_VOCAB.get(w), name="site-vocab")

A guard against duplicate registration is good practice if the module might be imported more than once:

from thaiphon import register_lexicon, registered_lexicons

if "site-vocab" not in registered_lexicons():
    register_lexicon(lambda w: SITE_VOCAB.get(w), name="site-vocab")

Thread safety¶

The override registry is a module-level singleton. register_lexicon and unregister_lexicon modify it, which is not thread-safe if called concurrently from multiple threads. The intended pattern is to register at startup (single-threaded) and leave the registry unchanged during request handling (read-only, safe from any thread). Dynamic registration during concurrent request handling requires external synchronisation.

Lookup itself — the path through your callable — is governed by your own callable's thread safety contract. A dict.get is safe; a SQLite connection with threading.local() (as shown above) is safe; an LRU cache wrapping a pure function is safe.

API reference¶

See Override lexicons — API for the full signatures of register_lexicon, unregister_lexicon, and registered_lexicons.