Pipeline overview

Architecture

thaiphon converts Thai text to phonetic notation through a linear pipeline of pure functions. Each stage is independently testable and makes no network calls or I/O.


The pipeline

The thaiphon pipeline. Thai text enters at the top and passes through six numbered stages: normalisation and expansion, lexicon lookups (with a hit shortcut that bypasses later stages), syllabification, rule-based derivation, the phonological-word data structure, and the renderer. The output is three parallel surface strings in IPA, TLC, and Morev notation.
Six stages. One immutable phonological model. A single call to transcribe() walks the whole procession.

Stage 1: Normalisation and expansion

Module: thaiphon.normalization

  • unicode_norm.normalize applies NFC normalisation, strips Unicode variation selectors, and reorders Thai combining marks into canonical order (vowel marks before tone marks before killer marks). This guards against mark-order variation that different input sources produce.
  • expand.expand rewrites abbreviation and shorthand forms: Sara Am (◌ำ) decomposes into its constituent /aː/ vowel + /m/ coda, ๆ (mai yamok) repeats the preceding word, ฯลฯ expands to "and so on", and isolated Thai digits are spelled out as words.

The public API functions call normalize automatically; all profiles and schemes receive NFC input.


Stage 2: Lexicon lookups

Modules: thaiphon.lexicons.*

Before running the derivation pipeline, the runner consults several lexicons:

  1. Indic learned lexicon — Sanskrit/Pali-derived words with register-dependent pronunciations. Takes highest precedence for words where colloquial vs. learned readings differ materially.
  2. Exact lexicon (VOLUBILIS / royal) — full-form word entries with pre-computed phonological words. When a word is found here, derivation is skipped entirely.
  3. Calendar lexicon — months, weekdays, abbreviations.
  4. Irregular readings and respelling lexicons — words whose syllabification or vowel pattern diverges from the standard rules. These are respelled into a form the derivation pipeline can handle.
  5. ทร and ฤ lexicons — the ambiguous digraphs described in Special cases.

If a word is found in any lexicon, the runner returns the lexicon's answer and skips syllabification and derivation. Lexicon entries carry a source="lexicon" tag in the resulting AnalysisResult.


Stage 3: Syllabification

Modules: thaiphon.tokenization.tcc, thaiphon.syllabification

For words not in any lexicon:

  1. TCC tokenizer (tcc.tokenize) breaks the input into Thai Character Cluster units — the smallest units that cannot be split across a syllable boundary. TCC is a well-established Thai NLP primitive.
  2. Candidate generator (syllabification.generator.CandidateGenerator) takes the TCC chunks and produces candidate segmentations. It applies heuristics for common cluster patterns (C+r, C+l, C+w), for the leading-ห pattern, and for the aksornam pattern.
  3. Candidate ranker (syllabification.ranker.CandidateRanker) scores candidates and returns them in rank order. The pipeline uses the top-ranked candidate.

Each candidate is a SyllabificationCandidate with a list of orthographic segments (one per syllable) and a score.


Stage 4: Rule-based derivation

Modules: thaiphon.derivation.*

Each candidate segment is passed to _derive_syllable, which runs five sub-derivations in sequence:

  1. Onset resolution (derivation.onset) — identify the initial consonant(s), their class, and IPA phoneme.
  2. Final extraction — scan the post-onset characters for a trailing consonant (with thanthakhat handling).
  3. Vowel resolution (derivation.vowel) — identify the vowel quality, length, and any offglide from the remaining characters.
  4. Coda resolution (derivation.coda) — map the final consonant to its collapsed IPA coda phoneme.
  5. Syllable type classification (derivation.syllable_type) — live vs. dead.
  6. Tone assignment (derivation.tone) — look up in the tone matrix.

After all segments are derived, two post-processing steps apply:

  • Aksornam propagation — bare leader consonants promote the following syllable's class.
  • Length overrides — the length lexicon corrects vowel length for words where the standard derivation would get it wrong.

Stage 5: PhonologicalWord assembly

Module: thaiphon.model

The derived syllables are assembled into a PhonologicalWord — a frozen dataclass containing a tuple of Syllable objects, a confidence score, a source tag, and the raw input string.

This is the universal intermediate. It is scheme-independent: every output scheme receives exactly the same PhonologicalWord.


Stage 6: Rendering

Modules: thaiphon.renderers.*

A Renderer transforms a PhonologicalWord into a string. Each built-in scheme is implemented as a MappingRenderer driven by a SchemeMapping — a frozen dataclass of maps and a tone_format function.

Rendering walks the syllable tuple, looks up each onset/vowel/coda in the appropriate map, calls tone_format to add tone decoration, and joins syllables with the syllable_separator.

The RenderContext carries the profile name and output format ("text" or "html") so schemes can gate profile-sensitive decisions (e.g. foreign coda preservation) without additional API arguments.


Subpackage map

Subpackage What lives there
thaiphon.api Public entry points: transcribe, analyze, list_schemes, etc.
thaiphon.model Frozen dataclasses: Syllable, PhonologicalWord, AnalysisResult, Phoneme, Cluster; all enumerations
thaiphon.normalization Unicode normalisation (unicode_norm) and text expansion (expand)
thaiphon.pipeline PipelineRunner — orchestrates all stages; RenderContext carrier
thaiphon.syllabification CandidateGenerator, CandidateRanker, strategies
thaiphon.tokenization tcc.tokenize — Thai Character Cluster tokenizer
thaiphon.derivation onset, vowel, coda, syllable_type, tone — one module per derivation step
thaiphon.tables Static lookup tables: consonants, final_collapse, tone_matrix, clusters, leaders
thaiphon.lexicons Lexicon modules: exact, indic_learned, irregular, loanword, length_overrides, royal, silent_h, ror_ror, thor, rue, etc.
thaiphon.renderers base (protocol + RenderContext), mapping (SchemeMapping, MappingRenderer), ipa, tlc, morev
thaiphon.segmentation longest.segment — dictionary-based longest-match Thai word segmenter
thaiphon.registry Registry generic, RENDERERS singleton
thaiphon.errors Exception hierarchy

Where to look for what

  • Adding a new scheme: thaiphon/renderers/ and thaiphon/registry.py. See Write your own scheme.
  • Fixing a derivation rule: thaiphon/derivation/ and thaiphon/tables/.
  • Adding a lexicon entry: thaiphon/lexicons/ — each file is a plain Python dict or frozenset.
  • Changing syllabification heuristics: thaiphon/syllabification/generator.py and thaiphon/syllabification/strategies.py.
  • Changing normalisation: thaiphon/normalization/.
  • Changing the public API signature: thaiphon/api.py.