Phonological model¶
The phonological model is thaiphon's universal intermediate representation — the scheme-independent data structure that sits between orthography parsing and rendering.
Design principles¶
- Immutable. Every object in the model is a frozen dataclass. Once computed, a
PhonologicalWordcannot be changed. This makes the model safe to cache, pass between threads, and use as a dictionary key. - IPA-based. All phoneme symbols use IPA notation internally. Schemes translate from IPA to their target notation — they do not invent a private phoneme representation.
- Slots. All frozen dataclasses use
__slots__for reduced memory overhead. - No I/O. The model objects contain no file paths, database references, or network handles.
The class hierarchy¶
PhonologicalWord
└── syllables: tuple[Syllable, ...]
├── onset: Phoneme | Cluster | None
│ ├── Phoneme.symbol: str (IPA)
│ └── Cluster.first + .second: Phoneme
├── vowel: Phoneme
├── vowel_length: VowelLength (SHORT | LONG)
├── coda: Phoneme | None
├── tone: Tone (MID | LOW | FALLING | HIGH | RISING)
├── tone_mark: ToneMark (NONE | MAI_EK | MAI_THO | MAI_TRI | MAI_JATTAWA)
├── effective_class: EffectiveClass (HIGH | MID | LOW)
├── syllable_type: SyllableType (LIVE | DEAD)
├── raw: str (orthographic slice)
└── inserted_vowel: bool (True when a vowel was implicitly inserted)
PhonologicalWord¶
| Field | Type | Description |
|---|---|---|
syllables |
tuple[Syllable, ...] |
The syllables of the word, in order |
morpheme_boundaries |
tuple[int, ...] |
Indices of morpheme boundaries (may be empty) |
confidence |
float |
Syllabification confidence score (1.0 = lexicon hit) |
source |
str |
"lexicon", "derivation", or "derivation+lexicon" |
raw |
str |
The original input string |
PhonologicalWord supports len() and iteration over its syllables:
result = analyze("สวัสดี")
word = result.best
len(word) # 3 (three syllables)
for syl in word:
print(syl.tone.name) # LOW, LOW, MID
Syllable¶
| Field | Type | Default | Description |
|---|---|---|---|
onset |
Phoneme \| Cluster \| None |
— | Initial consonant(s) |
vowel |
Phoneme |
— | Nucleus vowel |
vowel_length |
VowelLength |
— | SHORT or LONG |
coda |
Phoneme \| None |
— | Final consonant, or None for open syllables |
tone |
Tone |
— | Derived tone |
tone_mark |
ToneMark |
NONE |
Written tone mark, if any |
effective_class |
EffectiveClass |
MID |
Class used for tone lookup (after leading-ห adjustment) |
syllable_type |
SyllableType |
LIVE |
LIVE or DEAD |
raw |
str |
"" |
Orthographic slice for this syllable |
inserted_vowel |
bool |
False |
True when an inherent vowel was inserted |
notes |
tuple[str, ...] |
() |
Diagnostic notes (for debugging) |
Phoneme and Cluster¶
Phoneme is a single IPA phoneme:
| Field | Type | Description |
|---|---|---|
symbol |
str |
IPA symbol (e.g. "kʰ", "aː", "m") |
is_aspirated |
bool |
True for aspirated stops |
is_sonorant |
bool |
True for sonorants (/m n ŋ j w r l/) |
Cluster is a two-phoneme onset cluster:
| Field | Type | Description |
|---|---|---|
first |
Phoneme |
First consonant of the cluster |
second |
Phoneme |
Second consonant (typically /r/, /l/, or /w/) |
Enumerations¶
from thaiphon.model.enums import Tone, VowelLength, SyllableType, ToneMark, EffectiveClass, ConsonantClass
All enumerations are str enums, meaning they compare equal to their string names:
| Enum | Values |
|---|---|
Tone |
MID, LOW, FALLING, HIGH, RISING |
VowelLength |
SHORT, LONG |
SyllableType |
LIVE, DEAD |
ToneMark |
NONE, MAI_EK, MAI_THO, MAI_TRI, MAI_JATTAWA |
EffectiveClass |
HIGH, MID, LOW |
ConsonantClass |
HIGH, MID, LOW_PAIRED, LOW_SONORANT |
AnalysisResult¶
Returned by analyze() and analyze_word():
| Field | Type | Description |
|---|---|---|
best |
PhonologicalWord |
Top-ranked phonological word |
alternatives |
tuple[PhonologicalWord, ...] |
Lower-ranked candidates (may be empty) |
source |
str |
"lexicon" or "derivation" |
raw |
str |
The normalised input string |
loan_analysis |
LoanAnalysis \| None |
Foreignness detector output (observational only) |