Accuracy & open problems¶
thaiphon achieves ~75% exact-match accuracy on a word-level benchmark against independent Wiktionary IPA data when installed with the optional thaiphon-data-volubilis lexicon package. On the base engine alone the figure is ~57%. This page explains what those numbers mean, how they were measured, how to reproduce them, and where the remaining 25% goes wrong.
Headline result¶
The 18 percentage-point jump comes almost entirely from the thaiphon-data-volubilis lexicon package, which supplies word-boundary information for Thai compound words. Without it, the engine mis-segments many multi-syllable compounds and assigns incorrect phonology to each fragment. See The lexicon data package below.
Benchmark corpora¶
Accuracy is measured against two independent, publicly-licensed Thai IPA corpora:
| Corpus | Source | License | Entries | Scheme | Base engine | With thaiphon-data-volubilis |
|---|---|---|---|---|---|---|
| kaikki.org Thai Wiktionary | kaikki.org/dictionary/rawdata.html | CC-BY-SA 4.0 | 17,014 | ipa |
~57% | ~75% |
| PyThaiNLP G2P (Wiktionary) | github.com/PyThaiNLP/pythainlp | CC0 | 15,782 | ipa |
— | ~73% |
The two-point gap between the corpora reflects slightly different word-list composition and editorial conventions, not a methodological difference. They cross-validate each other: both confirm the engine is in the low-to-mid 70s range on modern, independently assembled Thai IPA data.
Methodology¶
What "exact match" means¶
thaiphon's ipa scheme output for a word is compared to the Wiktionary IPA after applying a set of notational normalisations that collapse equivalent surface representations of the same phonological analysis:
- Enclosing
/ /or[ ]brackets are stripped. - IPA stress marks (
ˈ ˌ) are removed — they are not phonemically distinctive in Thai. - The combining tie-bar in affricates (
t͡ɕ→tɕ) is removed; both notations appear in Wiktionary. - Centring-diphthong length variants are unified (
iːə,iːə̯,iə̯→iə, and similarly forɯanducentrings). - Offglide diacritics are normalised (
i̯→j,u̯→w). - An implicit glottal-stop onset
ʔat word start and after syllable boundaries is stripped — some Wiktionary editors write it explicitly, others omit it. - Released and unreleased stop codas are treated as equivalent (
p=p̚,t=t̚,k=k̚).
After normalisation, the comparison is character-by-character. Tone contours must match (same Chao tone letters in the same position). Vowel length must match. Syllable boundaries must match. Any remaining difference counts as a mismatch.
This is a strict criterion. Many mismatches represent legitimate pronunciation variation rather than clear errors. For pedagogical purposes — where a close-but-not-identical transcription may still be useful — the effective accuracy is higher.
Mismatch classification¶
When two IPA strings disagree, the comparison classifies the primary dimension of difference:
- Syllable count — the engine and Wiktionary parse the word into a different number of syllables. This is the largest single bucket (~10% of words), driven by compound segmentation, abbreviations, and Sanskrit/Pali words with consonant clusters.
- Mixed — differences across more than one dimension simultaneously (~5.6% of words). Common in foreign loanwords, where the segmentation, vowel quality, and coda may all differ from the reference at once.
- Vowel length — onset and coda agree, but one transcription marks the vowel as long and the other as short (~4.3% of words). Concentrated in English loanwords, where the Thai spelling is sometimes interpreted as representing the source-language vowel duration.
- Tone — all phonemes agree, but the Chao tone contour differs (~3.6% of words). Occurs in foreign-derived words, fossilised compounds, and cases where the engine and the Wiktionary editor make different assumptions about the sandhi environment.
- Coda — segmentation and vowel agree, but the final consonant differs (~0.7% of words). Driven by loanword final-coda variation (see Open problems).
- Onset — everything else agrees, but the initial consonant cluster differs (~0.4% of words). Core cluster parsing is largely correct; the residual involves a handful of orthographically unusual clusters.
What happens without the lexicon?¶
Failures are silent
The engine always returns a transliteration. There is no error flag, no confidence score, and no indication at the output layer of which syllables came from the lexicon versus derivation from rules alone. You cannot filter out unreliable results after the fact — the only way to know which outputs are trustworthy is to have the lexicon installed in the first place.
Without thaiphon-data-volubilis, two classes of error are especially common:
Closed-syllable ambiguity. ส้ม ("orange") comes out as /sa˥˩.ma˦˥/ — two syllables with an inserted /a/ — when the correct reading is /som˥˩/, one closed syllable with an inherent /o/. The phonological rules alone cannot choose between those two parses; the lexicon resolves the ambiguity.
Sanskrit-compound insertions. มหาวิทยาลัย ("university") loses the /tʰa/ insertion between /wit̚/ and /jaː/ that educated speech expects. The base engine emits /wit̚.jaː.laj/; the correct spoken form is /wit̚.tʰa.jaː.laj/. Words with Sanskrit-derived internal structure rely on the lexicon to supply their learned readings.
The lexicon data package¶
thaiphon-data-volubilis ships a ~35,000-entry Thai lexicon derived from the VOLUBILIS Mundo Dictionary (CC-BY-SA 4.0). The engine detects it automatically on import.
What it adds:
- Word-boundary segmentation for Thai compound words. Thai script has no spaces between words, so the engine must decide where one word ends and the next begins. Without a lexicon, the engine uses a longest-match heuristic that frequently mis-segments. With the lexicon, it can identify known compound words and segment them correctly.
- Variant and register coverage for vocabulary where the standard phonological rules alone do not determine the correct output — including lexically irregular forms, loanword coda policies, and Sanskrit/Pali learned readings.
Why it accounts for most of the 57% → 75% jump:
The base engine's syllabification heuristics work well for simple, common words, but multi-syllable compounds are extremely common in Thai and the heuristic regularly mis-divides them. A single mis-segmented compound can generate two or more mismatch records in the benchmark. The lexicon eliminates most of this systematic error in one step.
Install:
The data package carries its own CC-BY-SA 4.0 license, separate from the engine's Apache-2.0 license. Installing it does not affect the licensing of your own code.
How to reproduce the numbers yourself¶
The repository ships a 2,500-entry random sample (seed 20260421, CC-BY-SA 4.0) drawn from the kaikki.org Thai Wiktionary dump as a bundled pytest fixture. This sample runs in seconds and requires no external download.
Bundled sample (recommended starting point)¶
# Install both packages.
pip install thaiphon thaiphon-data-volubilis
# Run the bundled 2,500-entry sample.
pytest tests/etalon/test_wiktionary_ipa_sample.py -v
# Floor: 72%. Measured: ~74% on the sample.
The floor is set at 72% — roughly 2 percentage points below the measured full-corpus rate — to give headroom for sampling variance while still catching any real regression.
If thaiphon-data-volubilis is not installed, the test skips automatically with a message pointing at the install command. make test always finishes cleanly whatever your setup.
Full corpus (complete measurement)¶
To measure against all 17,014 entries, download the kaikki.org dump and point the test at it:
# Download the Thai JSONL file from:
# https://kaikki.org/dictionary/rawdata.html
# The Thai entry file is approximately 43 MB.
# Option A — place at the default cache location:
mkdir -p ~/.cache/thaiphon
mv kaikki-thai.jsonl ~/.cache/thaiphon/
# Option B — set an environment variable:
export THAIPHON_KAIKKI=/path/to/kaikki-thai.jsonl
# Run the full test:
pytest tests/etalon/test_wiktionary_ipa_full.py -v
# Floor: 73%. Measured: ~75% on the full corpus.
See tests/README.md for further detail and tests/fixtures/README.md for fixture licensing and sampling parameters.
Open problems¶
The remaining ~25% of mismatches fall into six categories. Some represent genuine limitations; others reflect annotation choices in the Wiktionary data itself.
1. Sanskrit and Pali citation forms¶
A large fraction of Thai vocabulary is borrowed from Sanskrit or Pali. These words often have two legitimate pronunciations: a full citation form used in scholarly or religious contexts, and a reduced everyday form used in ordinary speech. For example, words ending in อิ (such as ภูมิ, ปกติ, ธรรมชาติ) may have the final vowel fully pronounced in the citation form but reduced or absent in casual speech.
Wiktionary entries record whichever form the contributing editor had in mind, and this is not always consistent across entries. The engine's learned_full reading profile targets citation forms; the default everyday profile targets colloquial speech. Mismatches arise when the engine's profile choice does not match Wiktionary's.
This is the largest single identifiable category of systematic mismatch.
2. Compound segmentation and syllabification ambiguity¶
Some Thai word strings can be legitimately divided at more than one boundary. The ranking algorithm picks one segmentation, which may not match the Wiktionary editor's choice. This is especially common in:
- Sanskrit/Pali-derived compounds where consonant clusters can be parsed in multiple ways.
- Abbreviations and acronyms where the spoken form does not correspond directly to the written characters.
- Long compound words where the correct division depends on morphological knowledge the engine does not have.
Segmentation errors dominate the mismatch count when measured by syllable count.
3. Loanword coda variation¶
English, French, and other foreign loanwords introduce final consonants that standard Thai phonotactics does not permit — particularly final /f/, /s/, and /l/. Speakers handle these differently depending on register, word familiarity, and individual practice. thaiphon takes a lexicon-driven approach: known loanwords are tagged with a per-word, per-profile preservation policy. Words outside the lexicon fall back to a heuristic.
Wiktionary records a single transcription per entry; if it records the collapsed native form for a word where the engine preserves the foreign coda (or vice versa), that counts as a mismatch. Examples: กราฟ (graph), ลิฟต์ (lift/elevator), กอล์ฟ (golf), เคส (case).
The etalon_compat reading profile collapses all foreign codas to their native-Thai equivalents and is calibrated to match dictionary-citation style.
4. Vowel length in closed-syllable loanwords¶
English loanwords sometimes use Thai vowel letters in a way that implies a different duration than the source language. Whether to interpret ลิฟต์ as having a short or long vowel is not fully determined by the Thai orthography, and Wiktionary editors and the engine sometimes disagree. This accounts for most of the vowel-length mismatch bucket.
5. Tone assignment in foreign-derived and fossilised words¶
For native Thai vocabulary, the five tones are determined algorithmically from consonant class, vowel length, syllable type, and tone marks. For foreign-derived words and older fossilised compounds, this algorithm may not match how the word is actually pronounced. Wiktionary may record the attested spoken tone; the engine may derive a different tone from the orthography.
6. Annotation variation in the Wiktionary source¶
Wiktionary is a collaboratively edited resource with many contributors and no single enforcing standard for Thai IPA. Some entries are inconsistent with others, some record non-standard or regional pronunciations, and a small number contain transcription errors. Any disagreement between the engine and a Wiktionary error counts against the accuracy score even if the engine is correct.
Road map¶
Current accuracy work targets the three largest buckets: improving compound segmentation for Sanskrit/Pali-derived vocabulary, refining the lexicon's per-word loanword coda policies, and expanding coverage of the learned_full profile for Indic-derived words. No release dates are committed; progress is incremental and tied to the available corpus evidence.
If you find a word where thaiphon's output is clearly wrong, please open a GitHub issue — real-world examples drive the most useful improvements.
Reporting errors and suggestions¶
Open a GitHub issue at github.com/5w0rdf15h/thaiphon/issues. Include:
- The Thai word.
- What thaiphon produces (with
scheme=andprofile=arguments). - What you believe the correct transcription is, and the source (a Wiktionary entry URL, a textbook reference, or an audio recording).
Contributions from linguists, Thai teachers, and native speakers are especially welcome. The open-problem categories above are good places to focus if you want to make a systematic impact.
References¶
- kaikki.org Wiktionary dump: kaikki.org/dictionary/rawdata.html
- PyThaiNLP project: github.com/PyThaiNLP/pythainlp
- Wiktionary Thai IPA conventions: Appendix:Thai pronunciation on the English Wiktionary