Override lexicons¶
thaiphon derives Thai phonology from orthographic rules and a built-in vocabulary. For most words the rule-based output is correct. But some words resist derivation: ceremonial compounds whose morpheme boundaries the segmenter cannot infer, domain-specific terms where your site holds a more authoritative form, or entries from corpora you cannot distribute upstream. Override lexicons are the escape hatch.
The idea in one paragraph¶
You register a Python callable. When thaiphon processes a Thai word, it calls your callable first — before consulting any built-in lexicon and before running the derivation rules. If your callable returns a PhonologicalWord, that result goes straight to the renderer; the rest of the pipeline is skipped. If it returns None, processing continues normally. Multiple callables can be stacked in priority order.
The library provides only the hook. Storage — an in-memory dict, a SQLite file, an HTTP call — is entirely your choice.
Why this exists¶
Pathological compounds¶
Thai Sanskrit-Pali compounds — ceremonial, religious, and legal vocabulary — sometimes have morpheme boundaries that only lexical knowledge can resolve. The full ceremonial name of Bangkok contains eleven morpheme boundaries that the rule-based segmenter cannot correctly infer from orthography alone. Rule-based derivation of such words produces a plausible but incorrect syllabification. A single hand-curated override entry fixes it immediately, without waiting for a library release.
Domain-specific correctness¶
A site running a Thai learner dictionary, a place-name gazetteer, a legal-terms database, or a brand-name lookup may have authoritative phonological forms that disagree with thaiphon's output for specific words. The override hook lets those sites inject their ground truth as a drop-in, with no changes to the engine.
Private corpora¶
Some authoritative phonological resources are licensed such that a consumer can use derived forms internally but cannot redistribute them. Override lexicons let consumers benefit from such data without contributing it to thaiphon's built-in vocabulary.
Where overrides sit in the pipeline¶
Thai text
│
▼
NFC normalisation + Sara-Am expansion
│
▼
Override lexicons ← your callable is called here
│ hit: return PhonologicalWord, skip everything below
│ miss: continue
▼
Built-in lexicons (VOLUBILIS exact, Indic learned, calendar, …)
│
▼
Syllabification + rule-based derivation
│
▼
PhonologicalWord
│
▼
Renderer → IPA / TLC / Morev / RTGS / Paiboon / LMT
Your callable receives the word in its post-normalisation Thai form — NFC-normalised, Sara-Am expanded, mark order canonicalised. You do not need to replicate thaiphon's normalisation pass.
One override feeds every renderer automatically. Register a PhonologicalWord once and IPA, TLC, Morev, RTGS, Paiboon, and LMT all render from the same structure.
Quick start¶
from thaiphon import register_lexicon, transcribe
from thaiphon.model.word import PhonologicalWord
from thaiphon.model.syllable import Syllable
from thaiphon.model.phoneme import Phoneme
from thaiphon.model.enums import VowelLength, Tone, ToneMark, EffectiveClass, SyllableType
# Build a PhonologicalWord for กรุงเทพ (Bangkok).
# The pipeline's default segmentation often treats this as two syllables;
# an override locks in the form you want.
bangkok = PhonologicalWord(
syllables=(
Syllable(
onset=Phoneme("k"),
vowel=Phoneme("u"),
vowel_length=VowelLength.SHORT,
coda=Phoneme("ŋ"),
tone=Tone.MID,
tone_mark=ToneMark.NONE,
effective_class=EffectiveClass.MID,
syllable_type=SyllableType.LIVE,
),
Syllable(
onset=Phoneme("tʰ", is_aspirated=True),
vowel=Phoneme("eː"),
vowel_length=VowelLength.LONG,
coda=Phoneme("p̚"),
tone=Tone.HIGH,
tone_mark=ToneMark.NONE,
effective_class=EffectiveClass.HIGH,
syllable_type=SyllableType.DEAD,
),
),
)
OVERRIDES = {"กรุงเทพ": bangkok}
register_lexicon(lambda w: OVERRIDES.get(w), name="my-site")
print(transcribe("กรุงเทพ", scheme="ipa"))
# '/kruŋ˧.tʰeːp̚˥/'
print(transcribe("กรุงเทพ", scheme="tlc"))
# 'groong{M} thaep{H}'
The source field of the returned AnalysisResult is set to 'override:my-site', so you can tell which layer answered:
Building a PhonologicalWord¶
The main thing you need to know is how to construct the data structure. Every syllable is a Syllable frozen dataclass; every phoneme is a Phoneme (or Cluster for two-consonant onsets). The fields map directly to the phonological concepts described in the phonological model.
Phoneme¶
from thaiphon.model.phoneme import Phoneme, Cluster
Phoneme(symbol: str, is_aspirated: bool = False, is_sonorant: bool = False)
Use IPA symbols for symbol. A few conventions:
| What you want | Symbol |
|---|---|
| Aspirated onset | Phoneme("kʰ", is_aspirated=True) |
| Sonorant onset | Phoneme("n", is_sonorant=True) |
| Long vowel | Phoneme("aː") |
| Short vowel | Phoneme("a") |
| Unreleased stop coda | Phoneme("p̚") or Phoneme("t̚") or Phoneme("k̚") |
| Sonorant coda | Phoneme("m"), Phoneme("n"), Phoneme("ŋ") |
| Glide coda | Phoneme("j"), Phoneme("w") |
For a two-consonant onset cluster such as /pl/ (ปลา):
Syllable¶
from thaiphon.model.syllable import Syllable
from thaiphon.model.enums import VowelLength, Tone, ToneMark, EffectiveClass, SyllableType
Syllable(
onset, # Phoneme | Cluster | None
vowel, # Phoneme
vowel_length, # VowelLength.SHORT | VowelLength.LONG
coda, # Phoneme | None (None for open syllables)
tone, # Tone.MID | .LOW | .FALLING | .HIGH | .RISING
tone_mark, # ToneMark.NONE | .MAI_EK | .MAI_THO | .MAI_TRI | .MAI_JATTAWA
effective_class,# EffectiveClass.HIGH | .MID | .LOW
syllable_type, # SyllableType.LIVE | SyllableType.DEAD
)
All four enumeration arguments have defaults, so you only need to pass the ones that matter:
# Open syllable with default effective_class and syllable_type
Syllable(
onset=Phoneme("k"),
vowel=Phoneme("aː"),
vowel_length=VowelLength.LONG,
coda=None,
tone=Tone.MID,
)
tone is the actual surface tone — one of the five Thai tones. This is what renderers use. tone_mark records which diacritic was written in the orthography, if any. For override entries where you already know the surface tone, set tone_mark=ToneMark.NONE unless you specifically want to record the orthographic mark.
effective_class is the consonant class used when the tone was derived. For override entries, set it to the class that corresponds to the onset consonant. Mid-class consonants, and low-class consonants modified by a leading ห to act as high-class, use EffectiveClass.HIGH; unmarked low-class sonorants use EffectiveClass.LOW; bare mid-class consonants use EffectiveClass.MID. Renderers do not inspect this field, but it is carried through for diagnostics. When in doubt, EffectiveClass.MID is a safe placeholder.
syllable_type is DEAD for syllables ending in an unreleased stop (p̚, t̚, k̚) or a short vowel with no coda; LIVE for everything else.
PhonologicalWord¶
from thaiphon.model.word import PhonologicalWord
PhonologicalWord(
syllables: tuple[Syllable, ...],
morpheme_boundaries: tuple[int, ...] = (), # syllable indices where morpheme splits occur
confidence: float = 1.0,
source: str = "derivation", # overridden by the pipeline to 'override:<name>'
raw: str = "", # overridden by the pipeline from the input text
)
The pipeline overwrites source and raw when it serves an override result, so you can leave them at their defaults.
morpheme_boundaries records where a compound word splits at the morpheme level. For a compound such as กรุงเทพ (two morphemes: กรุง and เทพ), pass morpheme_boundaries=(1,) — meaning there is a boundary after syllable index 1. This information is available to renderers that use it (currently for diagnostic purposes).
Worked examples¶
Example 1: in-memory dict¶
The simplest possible override: a plain Python dict.
from thaiphon import register_lexicon, transcribe, analyze
from thaiphon.model.word import PhonologicalWord
from thaiphon.model.syllable import Syllable
from thaiphon.model.phoneme import Phoneme
from thaiphon.model.enums import VowelLength, Tone, EffectiveClass, SyllableType
def _syllable(onset_sym, vowel_sym, length, coda_sym, tone, ec, st):
return Syllable(
onset=Phoneme(onset_sym) if onset_sym else None,
vowel=Phoneme(vowel_sym),
vowel_length=length,
coda=Phoneme(coda_sym) if coda_sym else None,
tone=tone,
effective_class=ec,
syllable_type=st,
)
V = VowelLength
T = Tone
EC = EffectiveClass
ST = SyllableType
SITE_OVERRIDES: dict[str, PhonologicalWord] = {
# กรุงเทพ — two-syllable compound
"กรุงเทพ": PhonologicalWord(
syllables=(
_syllable("k", "u", V.SHORT, "ŋ", T.MID, EC.MID, ST.LIVE),
_syllable("tʰ", "eː", V.LONG, "p̚", T.HIGH, EC.HIGH, ST.DEAD),
),
morpheme_boundaries=(1,),
),
# สิงหาคม — Indic month name, three syllables
"สิงหาคม": PhonologicalWord(
syllables=(
_syllable("s", "i", V.SHORT, "ŋ", T.MID, EC.MID, ST.LIVE),
_syllable("h", "aː", V.LONG, None, T.MID, EC.LOW, ST.LIVE),
_syllable("kʰ", "o", V.SHORT, "m", T.MID, EC.HIGH, ST.LIVE),
),
),
}
register_lexicon(lambda w: SITE_OVERRIDES.get(w), name="site-vocab")
# Override resolves first.
result = analyze("กรุงเทพ")
print(result.source) # 'override:site-vocab'
print(transcribe("กรุงเทพ", scheme="ipa")) # '/kruŋ˧.tʰeːp̚˥/'
# Non-override words go through the normal pipeline.
print(transcribe("สวัสดี", scheme="ipa")) # '/sa˨˩.wat̚˨˩.diː˧/'
Example 2: SQLite-backed override store¶
For larger override collections, a SQLite file scales without occupying process memory up front.
Schema:
-- One row per Thai word.
CREATE TABLE overrides (
thai_word TEXT PRIMARY KEY,
payload TEXT NOT NULL -- JSON-serialised PhonologicalWord
) WITHOUT ROWID;
The payload column holds a serialised PhonologicalWord. The serialisation format is your choice. JSON with a custom encoder is the most portable option — human-readable and inspectable in any SQLite browser. A bespoke column-per-field layout gives full SQL queryability over individual phoneme fields at the cost of a more complex schema.
The lookup function:
import json
import sqlite3
import threading
from thaiphon import register_lexicon
from thaiphon.model.word import PhonologicalWord
DB_PATH = "/var/data/my-app/overrides.db"
_local = threading.local()
def _conn() -> sqlite3.Connection:
if not hasattr(_local, "conn"):
_local.conn = sqlite3.connect(
f"file:{DB_PATH}?mode=ro&immutable=1",
uri=True,
check_same_thread=False,
)
return _local.conn
def _from_json(data: str) -> PhonologicalWord:
# Implement your own deserialiser that matches how you serialised.
...
def _lookup(word: str) -> PhonologicalWord | None:
row = _conn().execute(
"SELECT payload FROM overrides WHERE thai_word = ?", (word,)
).fetchone()
if row is None:
return None
return _from_json(row[0])
register_lexicon(_lookup, name="sqlite-overrides")
The threading.local() connection pattern means each thread opens its own connection on first use, with no locking overhead. This is the same pattern used by thaiphon-data-volubilis.
Populating the database:
import sqlite3
from thaiphon.model.word import PhonologicalWord
def _to_json(word: PhonologicalWord) -> str:
# Implement your own serialiser.
...
conn = sqlite3.connect("/var/data/my-app/overrides.db")
conn.execute("""
CREATE TABLE IF NOT EXISTS overrides (
thai_word TEXT PRIMARY KEY,
payload TEXT NOT NULL
) WITHOUT ROWID
""")
conn.execute(
"INSERT OR REPLACE INTO overrides VALUES (?, ?)",
("กรุงเทพ", _to_json(bangkok)),
)
conn.commit()
conn.close()
Example 3: priority-ordered layers¶
When you have multiple override sources — for example, a site-specific dictionary at low priority and a per-session user correction at high priority — register them separately and let the priority argument control resolution order.
from thaiphon import register_lexicon, registered_lexicons, transcribe
# Site-wide vocabulary — lower priority (default 0).
register_lexicon(lambda w: SITE_VOCAB.get(w), name="site-vocab", priority=0)
# Per-session user corrections — higher priority.
register_lexicon(lambda w: SESSION_VOCAB.get(w), name="session-corrections", priority=10)
# Resolution order: highest priority first.
print(registered_lexicons())
# ('session-corrections', 'site-vocab')
# If SESSION_VOCAB has an entry for a word, it wins over SITE_VOCAB.
# If SESSION_VOCAB returns None, SITE_VOCAB is tried next.
# If both return None, the built-in pipeline runs as normal.
registered_lexicons() always returns names in resolution order, so you can confirm the stack at runtime.
To remove a layer — for example, when a session ends:
from thaiphon import unregister_lexicon
unregister_lexicon("session-corrections") # returns True if found, False if not
print(registered_lexicons())
# ('site-vocab',)
Timing of registration¶
Register your lexicons before the first transcribe or analyze call that should use them. Module-level registration (at import time) is the safest pattern for application code:
# my_app/thaiphon_setup.py — import this module early in your application startup.
from thaiphon import register_lexicon
from my_app.overrides import SITE_VOCAB
register_lexicon(lambda w: SITE_VOCAB.get(w), name="site-vocab")
A guard against duplicate registration is good practice if the module might be imported more than once:
from thaiphon import register_lexicon, registered_lexicons
if "site-vocab" not in registered_lexicons():
register_lexicon(lambda w: SITE_VOCAB.get(w), name="site-vocab")
Thread safety¶
The override registry is a module-level singleton. register_lexicon and unregister_lexicon modify it, which is not thread-safe if called concurrently from multiple threads. The intended pattern is to register at startup (single-threaded) and leave the registry unchanged during request handling (read-only, safe from any thread). Dynamic registration during concurrent request handling requires external synchronisation.
Lookup itself — the path through your callable — is governed by your own callable's thread safety contract. A dict.get is safe; a SQLite connection with threading.local() (as shown above) is safe; an LRU cache wrapping a pure function is safe.
API reference¶
See Override lexicons — API for the full signatures of register_lexicon, unregister_lexicon, and registered_lexicons.