Architecture¶

thaiphon converts Thai text to phonetic notation through a linear pipeline of pure functions. Each stage is independently testable and makes no network calls or I/O.

The pipeline¶

The thaiphon pipeline. Thai text enters at the top and passes through six numbered stages: normalisation and expansion, lexicon lookups (with a hit shortcut that bypasses later stages), syllabification, rule-based derivation, the phonological-word data structure, and the renderer. The output is three parallel surface strings in IPA, TLC, and Morev notation. — Six stages. One immutable phonological model. A single call to `transcribe()` walks the whole procession.

Stage 1: Normalisation and expansion¶

Module: thaiphon.normalization

unicode_norm.normalize applies NFC normalisation, strips Unicode variation selectors, and reorders Thai combining marks into canonical order (vowel marks before tone marks before killer marks). This guards against mark-order variation that different input sources produce.
expand.expand rewrites abbreviation and shorthand forms: Sara Am (◌ำ) decomposes into its constituent /aː/ vowel + /m/ coda, ๆ (mai yamok) repeats the preceding word, ฯลฯ expands to "and so on", and isolated Thai digits are spelled out as words.

The public API functions call normalize automatically; all profiles and schemes receive NFC input.

Stage 2: Lexicon lookups¶

Modules: thaiphon.overrides, thaiphon.lexicons.*

Before running the derivation pipeline, the runner first consults any override lexicons registered by the caller via register_lexicon. If any registered lookup returns a PhonologicalWord, the pipeline skips all subsequent stages for that word. See Override lexicons for how to register overrides.

After the override pass, the runner consults several built-in lexicons:

Indic learned lexicon — Sanskrit/Pali-derived words with register-dependent pronunciations. Takes highest precedence for words where colloquial vs. learned readings differ materially.
Exact lexicon (VOLUBILIS / royal) — full-form word entries with pre-computed phonological words. When a word is found here, derivation is skipped entirely.
Calendar lexicon — months, weekdays, abbreviations.
Irregular readings and respelling lexicons — words whose syllabification or vowel pattern diverges from the standard rules. These are respelled into a form the derivation pipeline can handle.
ทร and ฤ lexicons — the ambiguous digraphs described in Special cases.

If a word is found in any lexicon, the runner returns the lexicon's answer and skips syllabification and derivation. Lexicon entries carry a source="lexicon" tag in the resulting AnalysisResult.

Stage 3: Syllabification¶

Modules: thaiphon.tokenization.tcc, thaiphon.syllabification

For words not in any lexicon:

TCC tokenizer (tcc.tokenize) breaks the input into Thai Character Cluster units — the smallest units that cannot be split across a syllable boundary. TCC is a well-established Thai NLP primitive.
Candidate generator (syllabification.generator.CandidateGenerator) takes the TCC chunks and produces candidate segmentations. It applies heuristics for common cluster patterns (C+r, C+l, C+w), for the leading-ห pattern, and for the aksornam pattern.
Candidate ranker (syllabification.ranker.CandidateRanker) scores candidates and returns them in rank order. The pipeline uses the top-ranked candidate.

Each candidate is a SyllabificationCandidate with a list of orthographic segments (one per syllable) and a score.

Stage 4: Rule-based derivation¶

Modules: thaiphon.derivation.*

Each candidate segment is passed to _derive_syllable, which runs five sub-derivations in sequence:

Onset resolution (derivation.onset) — identify the initial consonant(s), their class, and IPA phoneme.
Final extraction — scan the post-onset characters for a trailing consonant (with thanthakhat handling).
Vowel resolution (derivation.vowel) — identify the vowel quality, length, and any offglide from the remaining characters.
Coda resolution (derivation.coda) — map the final consonant to its collapsed IPA coda phoneme.
Syllable type classification (derivation.syllable_type) — live vs. dead.
Tone assignment (derivation.tone) — look up in the tone matrix.

After all segments are derived, two post-processing steps apply:

Aksornam propagation — bare leader consonants promote the following syllable's class.
Length overrides — the length lexicon corrects vowel length for words where the standard derivation would get it wrong.

Stage 5: PhonologicalWord assembly¶

Module: thaiphon.model

The derived syllables are assembled into a PhonologicalWord — a frozen dataclass containing a tuple of Syllable objects, a confidence score, a source tag, and the raw input string.

This is the universal intermediate. It is scheme-independent: every output scheme receives exactly the same PhonologicalWord.

Stage 6: Rendering¶

Modules: thaiphon.renderers.*

A Renderer transforms a PhonologicalWord into a string. Each built-in scheme is implemented as a MappingRenderer driven by a SchemeMapping — a frozen dataclass of maps and a tone_format function.

Rendering walks the syllable tuple, looks up each onset/vowel/coda in the appropriate map, calls tone_format to add tone decoration, and joins syllables with the syllable_separator.

The RenderContext carries the profile name and output format ("text" or "html") so schemes can gate profile-sensitive decisions (e.g. foreign coda preservation) without additional API arguments.

Subpackage map¶

Subpackage	What lives there
`thaiphon.api`	Public entry points: `transcribe`, `analyze`, `list_schemes`, etc.
`thaiphon.model`	Frozen dataclasses: `Syllable`, `PhonologicalWord`, `AnalysisResult`, `Phoneme`, `Cluster`; all enumerations
`thaiphon.normalization`	Unicode normalisation (`unicode_norm`) and text expansion (`expand`)
`thaiphon.pipeline`	`PipelineRunner` — orchestrates all stages; `RenderContext` carrier
`thaiphon.syllabification`	`CandidateGenerator`, `CandidateRanker`, strategies
`thaiphon.tokenization`	`tcc.tokenize` — Thai Character Cluster tokenizer
`thaiphon.derivation`	`onset`, `vowel`, `coda`, `syllable_type`, `tone` — one module per derivation step
`thaiphon.tables`	Static lookup tables: `consonants`, `final_collapse`, `tone_matrix`, `clusters`, `leaders`
`thaiphon.lexicons`	Lexicon modules: `exact`, `indic_learned`, `irregular`, `loanword`, `length_overrides`, `royal`, `silent_h`, `ror_ror`, `thor`, `rue`, etc.
`thaiphon.renderers`	`base` (protocol + `RenderContext`), `mapping` (`SchemeMapping`, `MappingRenderer`), `ipa`, `tlc`, `morev`
`thaiphon.segmentation`	`longest.segment` — dictionary-based longest-match Thai word segmenter
`thaiphon.overrides`	`register_lexicon`, `unregister_lexicon`, `registered_lexicons` — user-supplied override hooks
`thaiphon.registry`	`Registry` generic, `RENDERERS` singleton
`thaiphon.errors`	Exception hierarchy

Where to look for what¶

Adding a new scheme: thaiphon/renderers/ and thaiphon/registry.py. See Write your own scheme.
Fixing a derivation rule: thaiphon/derivation/ and thaiphon/tables/.
Adding a lexicon entry: thaiphon/lexicons/ — each file is a plain Python dict or frozenset.
Changing syllabification heuristics: thaiphon/syllabification/generator.py and thaiphon/syllabification/strategies.py.
Changing normalisation: thaiphon/normalization/.
Changing the public API signature: thaiphon/api.py.