Database schema for the LLM-generated English-Chinese dictionary — design, motivations, and rationale
LinguaNaturalis is building an English-Chinese dictionary powered by LLM generation. The source data comes from ECDICT (770K words with frequency rankings, tags, and Chinese translations), which serves as seed data. An LLM pipeline generates higher-quality definitions and translations, which are then normalized into this relational database.
The dictionary is read-heavy — users look up words, not edit them. The database must serve lookups fast under high concurrency on a server. Every design decision optimizes for the hot path: given a word string, return its entries and definitions with zero unnecessary JOINs.
words ──────────▸ definitions (hot path: zero JOINs)
│
├──1:N──▸ word_relationships (cold path: morphological links)
│
├──1:N──▸ word_forms (reverse lookup inventory)
│
├──1:N──▸ word_tags (indexed tag queries)
│
└──1:N──▸ phonetics (pronunciation variants with notes)
Hot path (every lookup): words + definitions — two flat queries, no JOINs.
Cold path (optional): word_relationships, word_forms, word_tags, phonetics — queried only when the UI needs extra info.
When a user searches for a word, here is exactly what happens:
One row per dictionary entry. Flat — no foreign keys to other words. Duplicates allowed for homographs ("lie" recline vs "lie" untruth).
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | INTEGER | PK | Auto-increment primary key |
| word | TEXT | NOT NULL | Surface form. Duplicates allowed for homographs. |
| collins | INTEGER | Collins star rating 1–5 | |
| oxford | INTEGER | Oxford 3000 membership (1 = yes) | |
| bnc | INTEGER | BNC frequency rank (lower = more frequent) | |
| frq | INTEGER | General frequency rank (lower = more frequent) | |
| audio_path | TEXT | Relative path: "a/ac/abstract.mp3" |
IDX word IDX collins IDX bnc IDX frq
Junction table for morphological links. Cold path — only queried when displaying "past participle of X" info.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | INTEGER | PK | Auto-increment primary key |
| word_id | INTEGER | FK → words.id NOT NULL | The word with the relationship |
| related_word_id | INTEGER | FK → words.id NOT NULL | The related word |
| relation_type | TEXT | NOT NULL | "headword", "inflected_form", "derived_form", "variant" |
| form_type | TEXT | "past_tense", "plural", "comparative"... |
IDX word_id IDX related_word_id UNIQUE word_id + related_word_id + relation_type
One row per tag per word. Indexed on tag for instant queries like "all IELTS words".
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | INTEGER | PK | Auto-increment primary key |
| word_id | INTEGER | FK → words.id NOT NULL | The word |
| tag | TEXT | NOT NULL | "zk", "gk", "ielts", "toefl", "cet4"... |
IDX word_id IDX tag UNIQUE word_id + tag
One row per English definition sense. Translations are columns on the same row — no JOINs for translations.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | INTEGER | PK | Auto-increment primary key |
| word_id | INTEGER | FK → words.id NOT NULL | The word this definition belongs to |
| sense_freq | INTEGER | NOT NULL | Frequency of this sense (lower = more common). Adjustable. |
| pos | TEXT | Part of speech (noun, verb, adj...) | |
| definition | TEXT | NOT NULL | English definition for this sense |
| translate_cn | TEXT | Simplified Chinese translation | |
| usage_register | TEXT | "formal", "informal", "slang", "literary", "technical" | |
| note | TEXT | Usage notes: "often used in negative contexts" | |
| review_status | TEXT | "unreviewed", "reviewed", "rejected" | |
| source_status | TEXT | "ecdict_seeded", "llm_generated", "human_edited" |
IDX word_id IDX word_id + sense_freq
Future columns: translate_cn_t (Traditional), translate_jp, translate_kr...
Pronunciation variants. One word can have multiple rows — each with a note explaining when this pronunciation applies.
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | INTEGER | PK | Auto-increment primary key |
| word_id | INTEGER | FK → words.id NOT NULL | The word |
| accent | TEXT | Accent code: "br", "us", "au"... NULL for general | |
| phonetic | TEXT | NOT NULL | IPA notation |
| note | TEXT | When this pronunciation applies |
IDX word_id
Inventory of all inflected forms. Used for reverse lookups: user types "went" → find "go".
| Column | Type | Constraints | Description |
|---|---|---|---|
| id | INTEGER | PK | Auto-increment primary key |
| word_id | INTEGER | FK → words.id NOT NULL | The parent word |
| form_type | TEXT | NOT NULL | Type of variation (see below) |
| form_value | TEXT | NOT NULL | The actual word form |
IDX word_id IDX form_value (reverse lookup)
words: id=42, word="break", collins=5 definitions: id=101, word_id=42, sense_freq=10, pos="n.", definition="an unexpected piece of good luck", translate_cn="意外的好运", review_status="reviewed", source_status="ecdict_seeded" id=102, word_id=42, sense_freq=20, pos="n.", definition="the occurrence of breaking", translate_cn="破裂", review_status="reviewed", source_status="ecdict_seeded" id=103, word_id=42, sense_freq=80, pos="v.", definition="separate into parts", translate_cn="打破", review_status="reviewed", source_status="ecdict_seeded" phonetics: word_id=42, accent=NULL, phonetic="breɪk", note="standard" word_forms: word_id=42, form_type="past_tense", form_value="broke" word_id=42, form_type="past_participle", form_value="broken" word_id=42, form_type="present_participle", form_value="breaking" word_tags: word_id=42, tag="zk" word_id=42, tag="gk" word_id=42, tag="ielts"
words: id=100, word="lie" (to recline) id=101, word="lie" (to tell an untruth) definitions (lie #100 — recline): id=601, word_id=100, sense_freq=10, pos="v.", definition="to be in or assume a horizontal position", translate_cn="躺", review_status="reviewed", source_status="ecdict_seeded" definitions (lie #101 — untruth): id=602, word_id=101, sense_freq=10, pos="n.", definition="an intentionally false statement", translate_cn="谎言", review_status="reviewed", source_status="ecdict_seeded" id=603, word_id=101, sense_freq=20, pos="v.", definition="to say something that is not true", translate_cn="撒谎", review_status="reviewed", source_status="ecdict_seeded" word_relationships: word_id=100, related_word_id=102, relation_type="inflected_form", form_type="past_tense" (lie #100's past tense is "lay" (id=102))
words: id=500, word="bound" definitions (standalone meanings): id=701, word_id=500, sense_freq=15, pos="adj", definition="heading to a destination", translate_cn="前往…的", review_status="reviewed", source_status="llm_generated" id=702, word_id=500, sense_freq=40, pos="adj", definition="certain to happen", translate_cn="注定的", review_status="reviewed", source_status="llm_generated" id=703, word_id=500, sense_freq=60, pos="adj", definition="tied or constrained", translate_cn="被绑住的", review_status="reviewed", source_status="ecdict_seeded" word_relationships: word_id=500, related_word_id=48, relation_type="inflected_form", form_type="past_participle" ("bound" is also the past participle of "bind" (id=48))
words: id=888, word="went" (no rows in definitions → reference-only item) word_relationships: word_id=888, related_word_id=89, relation_type="inflected_form", form_type="past_tense" ("went" is the past tense of "go" (id=89)) word_forms (on the parent "go"): word_id=89, form_type="past_tense", form_value="went"
words: id=50, word="present" definitions: id=801, word_id=50, sense_freq=10, pos="n.", definition="a thing given to someone", translate_cn="礼物" id=802, word_id=50, sense_freq=20, pos="adj", definition="being in a particular place", translate_cn="出席的" id=803, word_id=50, sense_freq=50, pos="v.", definition="to hand over or show", translate_cn="呈现" phonetics: word_id=50, accent=NULL, phonetic="/ˈprezənt/", note="noun and adjective forms" word_id=50, accent=NULL, phonetic="/prɪˈzent/", note="verb form" word_id=50, accent="br", phonetic="/ˈprezənt/", note="British, noun/adj" word_id=50, accent="us", phonetic="/ˈpreznt/", note="American, noun/adj"
words: id=200, word="either" phonetics: word_id=200, accent=NULL, phonetic="/ˈiːðər/", note="common variant" word_id=200, accent=NULL, phonetic="/ˈaɪðər/", note="also widely accepted"
Decisions made during the design process, with reasoning.
These are used constantly for sorting and filtering — "show me the most important IELTS words." Keeping them on the words table means a single ORDER BY frq ASC with no JOIN. A junction table would add a JOIN to every frequency query, which hurts under high concurrency.
LIKE '%ielts%' is slow and not indexable. At 770K words, scanning the entire table for every tag query is unacceptable. A separate word_tags table with an index on tag makes "all TOEFL words" instant. The extra table costs ~60MB but saves query time on every tag-based lookup.
An order is static — you'd need a migration to change which definition appears first. A frequency is dynamic — it can be reweighted based on user behavior, corpus statistics, or LLM re-ranking. "The occurrence of breaking" might be more common than "an unexpected piece of good luck," and the data should reflect that.
English has homographs — words with the same spelling but unrelated meanings. "Lie" (recline) and "lie" (untruth) are etymologically different words that happen to share a surface form. They need separate IDs and separate definitions. A UNIQUE constraint on word would force them into one entry, losing the distinction.
The split (terms → entry_terms → entries) is semantically cleaner but forces a JOIN on every lookup. Under high concurrency on a server, even millisecond-level JOIN overhead compounds. The flat words table means the hot path is two simple indexed queries with zero JOINs.
"bound" has its own row in words and its own definitions (standalone). It also has a row in word_relationships pointing to "bind" (inflected_form). The two facts coexist without conflict. The hot path only reads words + definitions; the relationship is fetched optionally on the cold path.
Pronunciation varies by accent, POS, register, and generation — not just one dimension. "either" has two pronunciations as the same POS. "present" differs by POS. Tying phonetic to a single column on words or definitions can't capture this. A separate table with a note column handles all cases: accent ("British"), POS ("verb form"), preference ("common variant"), etc.
A separate sense_translations table requires a JOIN for every definition lookup. With translations as columns on definitions, you get the complete sense (English + Chinese) in a single query. Adding a new language (Japanese, Korean) is just an ALTER TABLE — add a column. At 3M definition rows, the extra NULL columns cost negligible space.
The LLM pipeline generates definitions in stages. source_status tracks origin: ecdict_seeded (migrated from ECDICT), llm_generated (created by LLM), human_edited (reviewed and modified). review_status tracks quality: unreviewed → reviewed or rejected. This lets you ship the dictionary incrementally — ECDICT data first, LLM data as it's validated.
word_forms is an inventory of inflected forms — "go" has went, gone, going, goes. It's used for reverse lookups: user types "went" → find "go." word_relationships tracks the semantic link between two words: "went is the past tense of go." They serve different purposes: one is a lookup index, the other is a relationship declaration.
A morphological form always traces back to exactly one origin. "went" is the past tense of "go" — not of any other word. "interesting" derives from "interest" — not from multiple sources. Even "bound," which is both standalone AND related to "bind," has exactly one morphological parent for that particular relationship.
The direction matters: a surface string can map to multiple dictionary entries (homographs — "lie" has two unrelated entries). But a dictionary entry has at most one morphological parent. This asymmetry is why word_relationships uses two foreign keys (word_id and related_word_id) rather than a many-to-many junction — each word points to at most one origin, not many.
| Table | Est. Rows | Size |
|---|---|---|
| words | ~780K | ~70 MB |
| definitions | ~3M | ~150 MB |
| word_tags | ~2M | ~60 MB |
| phonetics | ~1.5M | ~50 MB |
| word_relationships | ~200K | ~10 MB |
| word_forms | ~100K | ~10 MB |
| Total (with indexes) | ~7.5M | ~360 MB |
Fine for a bundled SQLite DB. Can ship compressed and decompress on first launch.
CREATE TABLE words ( id INTEGER PRIMARY KEY, word TEXT NOT NULL, collins INTEGER, oxford INTEGER, bnc INTEGER, frq INTEGER, audio_path TEXT ); CREATE INDEX idx_words_word ON words(word); CREATE INDEX idx_words_collins ON words(collins); CREATE INDEX idx_words_bnc ON words(bnc); CREATE INDEX idx_words_frq ON words(frq); CREATE TABLE word_relationships ( id INTEGER PRIMARY KEY, word_id INTEGER NOT NULL, related_word_id INTEGER NOT NULL, relation_type TEXT NOT NULL, form_type TEXT, FOREIGN KEY (word_id) REFERENCES words(id), FOREIGN KEY (related_word_id) REFERENCES words(id), UNIQUE (word_id, related_word_id, relation_type) ); CREATE INDEX idx_wr_word ON word_relationships(word_id); CREATE INDEX idx_wr_rel ON word_relationships(related_word_id); CREATE TABLE word_tags ( id INTEGER PRIMARY KEY, word_id INTEGER NOT NULL, tag TEXT NOT NULL, FOREIGN KEY (word_id) REFERENCES words(id), UNIQUE (word_id, tag) ); CREATE INDEX idx_wt_tag ON word_tags(tag); CREATE TABLE definitions ( id INTEGER PRIMARY KEY, word_id INTEGER NOT NULL, sense_freq INTEGER NOT NULL, pos TEXT, definition TEXT NOT NULL, translate_cn TEXT, usage_register TEXT, note TEXT, review_status TEXT, source_status TEXT, FOREIGN KEY (word_id) REFERENCES words(id) ); CREATE INDEX idx_def_word ON definitions(word_id); CREATE TABLE phonetics ( id INTEGER PRIMARY KEY, word_id INTEGER NOT NULL, accent TEXT, phonetic TEXT NOT NULL, note TEXT, FOREIGN KEY (word_id) REFERENCES words(id) ); CREATE INDEX idx_phon_word ON phonetics(word_id); CREATE TABLE word_forms ( id INTEGER PRIMARY KEY, word_id INTEGER NOT NULL, form_type TEXT NOT NULL, form_value TEXT NOT NULL, FOREIGN KEY (word_id) REFERENCES words(id) ); CREATE INDEX idx_wf_word ON word_forms(word_id); CREATE INDEX idx_wf_value ON word_forms(form_value);