Architecture¶
thaiphon converts Thai text to phonetic notation through a linear pipeline of pure functions. Each stage is independently testable and makes no network calls or I/O.
The pipeline¶
transcribe() walks the whole procession.Stage 1: Normalisation and expansion¶
Module: thaiphon.normalization
unicode_norm.normalizeapplies NFC normalisation, strips Unicode variation selectors, and reorders Thai combining marks into canonical order (vowel marks before tone marks before killer marks). This guards against mark-order variation that different input sources produce.expand.expandrewrites abbreviation and shorthand forms: Sara Am (◌ำ) decomposes into its constituent /aː/ vowel + /m/ coda, ๆ (mai yamok) repeats the preceding word, ฯลฯ expands to "and so on", and isolated Thai digits are spelled out as words.
The public API functions call normalize automatically; all profiles and schemes receive NFC input.
Stage 2: Lexicon lookups¶
Modules: thaiphon.lexicons.*
Before running the derivation pipeline, the runner consults several lexicons:
- Indic learned lexicon — Sanskrit/Pali-derived words with register-dependent pronunciations. Takes highest precedence for words where colloquial vs. learned readings differ materially.
- Exact lexicon (VOLUBILIS / royal) — full-form word entries with pre-computed phonological words. When a word is found here, derivation is skipped entirely.
- Calendar lexicon — months, weekdays, abbreviations.
- Irregular readings and respelling lexicons — words whose syllabification or vowel pattern diverges from the standard rules. These are respelled into a form the derivation pipeline can handle.
- ทร and ฤ lexicons — the ambiguous digraphs described in Special cases.
If a word is found in any lexicon, the runner returns the lexicon's answer and skips syllabification and derivation. Lexicon entries carry a source="lexicon" tag in the resulting AnalysisResult.
Stage 3: Syllabification¶
Modules: thaiphon.tokenization.tcc, thaiphon.syllabification
For words not in any lexicon:
- TCC tokenizer (
tcc.tokenize) breaks the input into Thai Character Cluster units — the smallest units that cannot be split across a syllable boundary. TCC is a well-established Thai NLP primitive. - Candidate generator (
syllabification.generator.CandidateGenerator) takes the TCC chunks and produces candidate segmentations. It applies heuristics for common cluster patterns (C+r, C+l, C+w), for the leading-ห pattern, and for the aksornam pattern. - Candidate ranker (
syllabification.ranker.CandidateRanker) scores candidates and returns them in rank order. The pipeline uses the top-ranked candidate.
Each candidate is a SyllabificationCandidate with a list of orthographic segments (one per syllable) and a score.
Stage 4: Rule-based derivation¶
Modules: thaiphon.derivation.*
Each candidate segment is passed to _derive_syllable, which runs five sub-derivations in sequence:
- Onset resolution (
derivation.onset) — identify the initial consonant(s), their class, and IPA phoneme. - Final extraction — scan the post-onset characters for a trailing consonant (with thanthakhat handling).
- Vowel resolution (
derivation.vowel) — identify the vowel quality, length, and any offglide from the remaining characters. - Coda resolution (
derivation.coda) — map the final consonant to its collapsed IPA coda phoneme. - Syllable type classification (
derivation.syllable_type) — live vs. dead. - Tone assignment (
derivation.tone) — look up in the tone matrix.
After all segments are derived, two post-processing steps apply:
- Aksornam propagation — bare leader consonants promote the following syllable's class.
- Length overrides — the length lexicon corrects vowel length for words where the standard derivation would get it wrong.
Stage 5: PhonologicalWord assembly¶
Module: thaiphon.model
The derived syllables are assembled into a PhonologicalWord — a frozen dataclass containing a tuple of Syllable objects, a confidence score, a source tag, and the raw input string.
This is the universal intermediate. It is scheme-independent: every output scheme receives exactly the same PhonologicalWord.
Stage 6: Rendering¶
Modules: thaiphon.renderers.*
A Renderer transforms a PhonologicalWord into a string. Each built-in scheme is implemented as a MappingRenderer driven by a SchemeMapping — a frozen dataclass of maps and a tone_format function.
Rendering walks the syllable tuple, looks up each onset/vowel/coda in the appropriate map, calls tone_format to add tone decoration, and joins syllables with the syllable_separator.
The RenderContext carries the profile name and output format ("text" or "html") so schemes can gate profile-sensitive decisions (e.g. foreign coda preservation) without additional API arguments.
Subpackage map¶
| Subpackage | What lives there |
|---|---|
thaiphon.api |
Public entry points: transcribe, analyze, list_schemes, etc. |
thaiphon.model |
Frozen dataclasses: Syllable, PhonologicalWord, AnalysisResult, Phoneme, Cluster; all enumerations |
thaiphon.normalization |
Unicode normalisation (unicode_norm) and text expansion (expand) |
thaiphon.pipeline |
PipelineRunner — orchestrates all stages; RenderContext carrier |
thaiphon.syllabification |
CandidateGenerator, CandidateRanker, strategies |
thaiphon.tokenization |
tcc.tokenize — Thai Character Cluster tokenizer |
thaiphon.derivation |
onset, vowel, coda, syllable_type, tone — one module per derivation step |
thaiphon.tables |
Static lookup tables: consonants, final_collapse, tone_matrix, clusters, leaders |
thaiphon.lexicons |
Lexicon modules: exact, indic_learned, irregular, loanword, length_overrides, royal, silent_h, ror_ror, thor, rue, etc. |
thaiphon.renderers |
base (protocol + RenderContext), mapping (SchemeMapping, MappingRenderer), ipa, tlc, morev |
thaiphon.segmentation |
longest.segment — dictionary-based longest-match Thai word segmenter |
thaiphon.registry |
Registry generic, RENDERERS singleton |
thaiphon.errors |
Exception hierarchy |
Where to look for what¶
- Adding a new scheme:
thaiphon/renderers/andthaiphon/registry.py. See Write your own scheme. - Fixing a derivation rule:
thaiphon/derivation/andthaiphon/tables/. - Adding a lexicon entry:
thaiphon/lexicons/— each file is a plain Python dict or frozenset. - Changing syllabification heuristics:
thaiphon/syllabification/generator.pyandthaiphon/syllabification/strategies.py. - Changing normalisation:
thaiphon/normalization/. - Changing the public API signature:
thaiphon/api.py.