Lexicon storage¶
thaiphon-data-volubilis ships 84 k Thai words, each pre-derived into a PhonologicalWord (syllable-segmented, with onsets, vowels, codas, and tones resolved). This page explains how that data is stored and served, and why the design holds up under the deployment patterns that matter for a web-facing package.
What's in the wheel¶
The data package contains a single file: lexicon.db, a read-only SQLite database. Its schema is a single table:
thai_word is the primary key, so SQLite's B-tree index on it gives O(log n) lookup. payload is the serialized PhonologicalWord for that entry. The WITHOUT ROWID declaration removes the hidden integer rowid column, keeping the index tight and the table scan order consistent with key order.
The package exposes one public name:
ENTRIES is a Mapping[str, PhonologicalWord] with the usual dict-like interface — __getitem__, __contains__, get, keys, items, values, __len__, __iter__. Callers don't see the storage backend.
How connections are managed¶
The SQLite connection is opened with immutable=1 and mode=ro:
import sqlite3
conn = sqlite3.connect(
"file:/path/to/lexicon.db?immutable=1&mode=ro",
uri=True,
check_same_thread=False,
)
conn.execute("PRAGMA mmap_size=268435456")
immutable=1 tells SQLite that the file will never be modified by any process. This removes locking overhead and, together with mmap_size, tells SQLite to memory-map the file rather than copying pages into its own buffer pool. The OS page cache then holds the data, not SQLite's private allocation.
SQLite does not allow sharing a connection across threads, so the package uses threading.local() to give each thread its own connection. The connection opens on first use in that thread, not at import time.
A functools.lru_cache(maxsize=10_000) sits above the per-key fetch so that repeated lookups of the same word within a process skip the deserialization step after the first call.
Memory numbers¶
Measured on a development machine (macOS, single process):
| Measurement | Engine alone | Engine + data package |
|---|---|---|
RSS after first transcribe_sentence |
~10 MiB | ~19 MiB |
| RSS after 1 k lookups | ~10 MiB | ~47 MiB |
| Median lookup latency (lexicon hit) | — | ~13 µs |
thaiphon-data-volubilis wheel |
— | 2.7 MiB |
lexicon.db on disk |
— | ~17.6 MiB |
The ~30 MiB of additional RSS after warmup is the per-process LRU of the last 10 k inflated entries. The database itself stays in mmap'd pages counted once by the kernel — not once per Python worker.
Multi-worker web servers¶
Because the database is memory-mapped through the OS page cache, every process that has imported the package shares the same physical pages for lexicon.db. The lexicon does not multiply with worker count.
A concrete example with gunicorn:
Each worker holds its own per-process Python objects — interpreter, your application, and the LRU of recently fetched entries. That comes to roughly 50 MiB per worker, so eight workers add up to about 400 MiB of Python heap. The ~25 MiB of database pages are counted once by the kernel and shared. Total host footprint is roughly 425 MiB, not 8 × 350 MiB.
The same applies to uvicorn with --workers N, to pytest-xdist with -n N, and to any other multi-process setup that reads from the same lexicon.db file.
Thread safety¶
Each thread that calls ENTRIES[key] lazily opens its own SQLite connection via threading.local(). Nothing crosses a thread boundary. FastAPI and Starlette's threadpool, Django's async views, and explicit threading.Thread usage all work without extra configuration. There are no locks or mutexes in the lookup path once a thread's connection is open.
Fork safety¶
The SQLite connection lives in threading.local(), so it is never inherited across a fork(). A child process that fork()'d from a parent that had already done lookups starts with no connection and opens its own on first use. Pre-forking gunicorn and uWSGI work correctly without any post-fork hooks.
Serverless and container cold starts¶
The package adds almost nothing to import time: no eager inflation of the lexicon, no large .pyc to unpack. The first lookup in a given container instance pays the SQLite open plus a single index-page read — typically single-digit milliseconds. Subsequent lookups within that instance hit the LRU.
Both packages fit inside a Lambda layer. The data package's 2.7 MiB wheel and 17.6 MiB lexicon.db file leave ample room in the 512 MiB Lambda memory tier, even after your own application imports.
The engine's dependency surface is unchanged¶
thaiphon itself imports nothing related to SQLite. sqlite3 is part of the CPython standard library and is available in every standard Python distribution, so the data package picks up no third-party runtime dependency either. The engine's own guarantee — zero runtime dependencies — continues to hold when both packages are installed.
Further reading¶
- Design constraints — the broader principles that shaped the engine's architecture.
- Install — installing both packages and verifying the setup.
- Accuracy — what the lexicon contributes to transliteration quality.