Phonological model

Phonological model

The phonological model is thaiphon's universal intermediate representation — the scheme-independent data structure that sits between orthography parsing and rendering.


Design principles

  • Immutable. Every object in the model is a frozen dataclass. Once computed, a PhonologicalWord cannot be changed. This makes the model safe to cache, pass between threads, and use as a dictionary key.
  • IPA-based. All phoneme symbols use IPA notation internally. Schemes translate from IPA to their target notation — they do not invent a private phoneme representation.
  • Slots. All frozen dataclasses use __slots__ for reduced memory overhead.
  • No I/O. The model objects contain no file paths, database references, or network handles.

The class hierarchy

PhonologicalWord
  └── syllables: tuple[Syllable, ...]
        ├── onset:  Phoneme | Cluster | None
        │     ├── Phoneme.symbol: str   (IPA)
        │     └── Cluster.first + .second: Phoneme
        ├── vowel:  Phoneme
        ├── vowel_length: VowelLength   (SHORT | LONG)
        ├── coda:   Phoneme | None
        ├── tone:   Tone               (MID | LOW | FALLING | HIGH | RISING)
        ├── tone_mark: ToneMark        (NONE | MAI_EK | MAI_THO | MAI_TRI | MAI_JATTAWA)
        ├── effective_class: EffectiveClass (HIGH | MID | LOW)
        ├── syllable_type: SyllableType    (LIVE | DEAD)
        ├── raw: str                   (orthographic slice)
        └── inserted_vowel: bool       (True when a vowel was implicitly inserted)

PhonologicalWord

from thaiphon.model.word import PhonologicalWord
Field Type Description
syllables tuple[Syllable, ...] The syllables of the word, in order
morpheme_boundaries tuple[int, ...] Indices of morpheme boundaries (may be empty)
confidence float Syllabification confidence score (1.0 = lexicon hit)
source str "lexicon", "derivation", or "derivation+lexicon"
raw str The original input string

PhonologicalWord supports len() and iteration over its syllables:

result = analyze("สวัสดี")
word = result.best

len(word)         # 3 (three syllables)
for syl in word:
    print(syl.tone.name)   # LOW, LOW, MID

Syllable

from thaiphon.model.syllable import Syllable
Field Type Default Description
onset Phoneme \| Cluster \| None Initial consonant(s)
vowel Phoneme Nucleus vowel
vowel_length VowelLength SHORT or LONG
coda Phoneme \| None Final consonant, or None for open syllables
tone Tone Derived tone
tone_mark ToneMark NONE Written tone mark, if any
effective_class EffectiveClass MID Class used for tone lookup (after leading-ห adjustment)
syllable_type SyllableType LIVE LIVE or DEAD
raw str "" Orthographic slice for this syllable
inserted_vowel bool False True when an inherent vowel was inserted
notes tuple[str, ...] () Diagnostic notes (for debugging)

Phoneme and Cluster

from thaiphon.model.phoneme import Phoneme, Cluster

Phoneme is a single IPA phoneme:

Field Type Description
symbol str IPA symbol (e.g. "kʰ", "aː", "m")
is_aspirated bool True for aspirated stops
is_sonorant bool True for sonorants (/m n ŋ j w r l/)

Cluster is a two-phoneme onset cluster:

Field Type Description
first Phoneme First consonant of the cluster
second Phoneme Second consonant (typically /r/, /l/, or /w/)

Enumerations

from thaiphon.model.enums import Tone, VowelLength, SyllableType, ToneMark, EffectiveClass, ConsonantClass

All enumerations are str enums, meaning they compare equal to their string names:

from thaiphon.model.enums import Tone
Tone.MID == "MID"   # True
Enum Values
Tone MID, LOW, FALLING, HIGH, RISING
VowelLength SHORT, LONG
SyllableType LIVE, DEAD
ToneMark NONE, MAI_EK, MAI_THO, MAI_TRI, MAI_JATTAWA
EffectiveClass HIGH, MID, LOW
ConsonantClass HIGH, MID, LOW_PAIRED, LOW_SONORANT

AnalysisResult

from thaiphon.model.candidate import AnalysisResult

Returned by analyze() and analyze_word():

Field Type Description
best PhonologicalWord Top-ranked phonological word
alternatives tuple[PhonologicalWord, ...] Lower-ranked candidates (may be empty)
source str "lexicon" or "derivation"
raw str The normalised input string
loan_analysis LoanAnalysis \| None Foreignness detector output (observational only)
from thaiphon import analyze

result = analyze("น้ำ")
result.best           # PhonologicalWord
result.best.syllables # tuple of Syllable
result.raw            # 'น้ำ'
result.source         # 'lexicon' (found in the lexicon)