Design constraints¶
The constraints that shaped thaiphon's architecture. Understanding these makes it easier to contribute without breaking assumptions that the rest of the system depends on.
Zero runtime dependencies¶
thaiphon ships as a pure-Python package with no runtime dependencies. Everything — the phonological rules, the lexicons, the syllabification logic, the normalisation — is implemented in standard-library Python.
Why: Zero-dependency packages can be installed in any environment without conflict. They work offline. They pass security audits more easily. They can be bundled into applications without transitive-dependency concerns.
Consequences:
- No NumPy, no machine-learning frameworks, no database drivers, no network calls.
- The lexicons are Python dicts and frozensets, not SQLite files or JSON blobs.
- Word segmentation uses a trie built from module-level Python literals.
- pythainlp is an optional dependency for sentence segmentation only, and thaiphon functions without it.
Immutable phonological model¶
Every object in thaiphon.model is a frozen dataclass. Phoneme, Cluster, Syllable, PhonologicalWord, AnalysisResult, SyllabificationCandidate — none can be mutated after creation.
Why: Immutability prevents a common class of bugs where a cached intermediate is accidentally modified by a downstream step. It makes the model safe for concurrent use. It allows the objects to be used as dictionary keys or in sets.
Consequences:
- Modifying a syllable requires creating a new one (dataclasses.replace is the idiomatic tool). The pipeline runner does this in several places (aksornam propagation, length overrides).
- Renderers read from PhonologicalWord without copying; there is no risk of the renderer dirtying the intermediate.
Pure functions¶
Every derivation step (derivation.onset.resolve_onset, derivation.vowel.resolve_vowel, etc.) is a pure function: given the same inputs, it always returns the same output, and it has no side effects.
Why: Pure functions are trivially testable. Each derivation step can be called directly in a test without setting up any global state.
Consequences:
- There is no global mutable state in the derivation layer.
- The pipeline runner is stateless after construction. A single PipelineRunner instance can be shared across threads.
- The lookup tables (tables.tone_matrix, tables.final_collapse, etc.) are module-level MappingProxyType objects — read-only dictionaries constructed once at import time.
NFC normalisation at the API boundary¶
All input text is NFC-normalised before any phonological processing. This happens in the pipeline runner's _analyze_core method before any other step.
Why: Thai is routinely represented in NFD, NFC, and non-canonical mark orders — depending on operating system, input method, web browser, and font toolchain. Without a normalisation step, the same phonological word would produce different internal representations.
Consequences:
- Callers do not need to pre-normalise input. NFC and NFD input produce identical output.
- The mark reordering step (unicode_norm._reorder_marks) handles the common case where vowel marks and tone marks are written in the wrong order.
- The variation-selector strip removes invisible characters that some font systems attach to Thai base characters.
Lexicons as Python literals¶
All lexicons built into the engine — the loanword list, the length overrides, the irregular readings, the silent-ห set — are defined as Python module-level dicts and frozensets.
Why: Python literals are version-controlled, greppable, and auditable. Adding an entry is a one-line diff. No migration scripts, no schema management, no database connectivity.
Consequences:
- Lexicons load at import time, not at first call. This adds a few milliseconds to the first import thaiphon but nothing to subsequent calls.
- Lexicon entries are regular Python values and can be inspected with dir() and pprint with no special tooling.
The data package uses SQLite, not Python literals¶
The optional thaiphon-data-volubilis package contains 84 k Thai words with pre-derived phonological readings — far too large to ship as a Python module without inflating import time and per-process RSS substantially. It stores its lexicon in a single read-only SQLite file (lexicon.db) instead.
Why SQLite: The file is opened with immutable=1 and a generous mmap_size, so the operating-system page cache can share the database's physical pages across every process that has imported the package. A gunicorn pool of eight workers, for instance, reads from one shared ~25 MiB pool of mapped pages rather than holding eight independent copies of the data in heap memory. A plain Python dict of 84 k entries with object children does not benefit from this: each worker allocates and holds its own copy, so total lexicon RSS scales with worker count. The file-based design avoids that.
Consequences:
- The engine itself (thaiphon) imports nothing related to SQLite; its dependency surface is unchanged.
- sqlite3 is part of the CPython standard library, so the data package picks up no third-party runtime dependency.
- The data package exposes ENTRIES: Mapping[str, PhonologicalWord] — a standard Mapping interface. Code that calls ENTRIES["สวัสดี"] does not need to know that SQLite is behind it.
- Connections open lazily, one per thread, via threading.local(). FastAPI, Django, and any other thread model work without extra configuration.
- See Lexicon storage for the full design, memory numbers, and deployment notes.
Scheme-neutral intermediate¶
The PhonologicalWord is deliberately scheme-neutral. It uses IPA symbols for all phonemes and does not make any notation choices.
Why: Decoupling the phonological analysis from the rendering means that improvements to the analysis (fixing a derivation rule, adding a lexicon entry) propagate to all schemes simultaneously. A bug fix for the IPA scheme automatically also fixes TLC and Morev.
Consequences:
- Schemes must declare a complete mapping for every IPA phoneme they might encounter. The unknown_fallback field handles gaps.
- The internal IPA symbols are not necessarily identical to standard Wiktionary IPA — they are a controlled internal vocabulary. The IPA renderer maps them one-to-one to standard IPA; other renderers map them to their own alphabet.
Python 3.10+¶
thaiphon uses match/case is not used (for broader compatibility), but does use:
- dataclasses with slots=True (3.10+)
- typing.Literal with multiple values (3.10+)
- from __future__ import annotations for deferred evaluation
Consequence: Python 3.9 and earlier are not supported.