LinguaNaturalis Dictionary DB

Database schema for the LLM-generated English-Chinese dictionary — design, motivations, and rationale

1 What Is This

LinguaNaturalis is building an English-Chinese dictionary powered by LLM generation. The source data comes from ECDICT (770K words with frequency rankings, tags, and Chinese translations), which serves as seed data. An LLM pipeline generates higher-quality definitions and translations, which are then normalized into this relational database.

The dictionary is read-heavy — users look up words, not edit them. The database must serve lookups fast under high concurrency on a server. Every design decision optimizes for the hot path: given a word string, return its entries and definitions with zero unnecessary JOINs.

2 Schema Overview

  words ──────────▸ definitions        (hot path: zero JOINs)
    │
    ├──1:N──▸ word_relationships       (cold path: morphological links)
    │
    ├──1:N──▸ word_forms               (reverse lookup inventory)
    │
    ├──1:N──▸ word_tags                (indexed tag queries)
    │
    └──1:N──▸ phonetics                (pronunciation variants with notes)

Hot path (every lookup): words + definitions — two flat queries, no JOINs.

Cold path (optional): word_relationships, word_forms, word_tags, phonetics — queried only when the UI needs extra info.

3 Lookup Workflow

When a user searches for a word, here is exactly what happens:

Case A: User types a headword ("break")

Search words table — SELECT * FROM words WHERE word = 'break'. Returns one row (or multiple for homographs like "lie").

Get definitions — SELECT * FROM definitions WHERE word_id IN (...) ORDER BY sense_freq ASC.

Get phonetics — SELECT * FROM phonetics WHERE word_id IN (...). Returns pronunciation variants with notes.

Get tags — SELECT * FROM word_tags WHERE word_id IN (...). Returns exam tags, level tags.

Case B: User types a morphological form ("went")

Search words table — no match. "went" is not a headword.

Fall back to word_forms — SELECT * FROM word_forms WHERE form_value = 'went'. Returns word_id pointing to "go".

Look up "go" as headword — same as Case A from step 1.

Show relationship — "went is the past tense of go".

4 Table Definitions

words

One row per dictionary entry. Flat — no foreign keys to other words. Duplicates allowed for homographs ("lie" recline vs "lie" untruth).

Column	Type	Constraints	Description
id	INTEGER	PK	Auto-increment primary key
word	TEXT	NOT NULL	Surface form. Duplicates allowed for homographs.
collins	INTEGER		Collins star rating 1–5
oxford	INTEGER		Oxford 3000 membership (1 = yes)
bnc	INTEGER		BNC frequency rank (lower = more frequent)
frq	INTEGER		General frequency rank (lower = more frequent)
audio_path	TEXT		Relative path: "a/ac/abstract.mp3"

IDX word IDX collins IDX bnc IDX frq

word_relationships

Junction table for morphological links. Cold path — only queried when displaying "past participle of X" info.

Column	Type	Constraints	Description
id	INTEGER	PK	Auto-increment primary key
word_id	INTEGER	FK → words.id NOT NULL	The word with the relationship
related_word_id	INTEGER	FK → words.id NOT NULL	The related word
relation_type	TEXT	NOT NULL	"headword", "inflected_form", "derived_form", "variant"
form_type	TEXT		"past_tense", "plural", "comparative"...

IDX word_id IDX related_word_id UNIQUE word_id + related_word_id + relation_type

word_tags

One row per tag per word. Indexed on tag for instant queries like "all IELTS words".

Column	Type	Constraints	Description
id	INTEGER	PK	Auto-increment primary key
word_id	INTEGER	FK → words.id NOT NULL	The word
tag	TEXT	NOT NULL	"zk", "gk", "ielts", "toefl", "cet4"...

IDX word_id IDX tag UNIQUE word_id + tag

definitions

One row per English definition sense. Translations are columns on the same row — no JOINs for translations.

Column	Type	Constraints	Description
id	INTEGER	PK	Auto-increment primary key
word_id	INTEGER	FK → words.id NOT NULL	The word this definition belongs to
sense_freq	INTEGER	NOT NULL	Frequency of this sense (lower = more common). Adjustable.
pos	TEXT		Part of speech (noun, verb, adj...)
definition	TEXT	NOT NULL	English definition for this sense
translate_cn	TEXT		Simplified Chinese translation
usage_register	TEXT		"formal", "informal", "slang", "literary", "technical"
note	TEXT		Usage notes: "often used in negative contexts"
review_status	TEXT		"unreviewed", "reviewed", "rejected"
source_status	TEXT		"ecdict_seeded", "llm_generated", "human_edited"

IDX word_id IDX word_id + sense_freq

Future columns: translate_cn_t (Traditional), translate_jp, translate_kr...

phonetics

Pronunciation variants. One word can have multiple rows — each with a note explaining when this pronunciation applies.

Column	Type	Constraints	Description
id	INTEGER	PK	Auto-increment primary key
word_id	INTEGER	FK → words.id NOT NULL	The word
accent	TEXT		Accent code: "br", "us", "au"... NULL for general
phonetic	TEXT	NOT NULL	IPA notation
note	TEXT		When this pronunciation applies

IDX word_id

word_forms

Inventory of all inflected forms. Used for reverse lookups: user types "went" → find "go".

Column	Type	Constraints	Description
id	INTEGER	PK	Auto-increment primary key
word_id	INTEGER	FK → words.id NOT NULL	The parent word
form_type	TEXT	NOT NULL	Type of variation (see below)
form_value	TEXT	NOT NULL	The actual word form

IDX word_id IDX form_value (reverse lookup)

form_type values

past_tense went

past_participle gone

present_participle going

third_person_singular goes

plural cats

comparative bigger

superlative biggest

base_form go

5 Walkthrough Examples

Simple word — "break"

words:
  id=42, word="break", collins=5

definitions:
  id=101, word_id=42, sense_freq=10, pos="n.",  definition="an unexpected piece of good luck",
    translate_cn="意外的好运", review_status="reviewed", source_status="ecdict_seeded"
  id=102, word_id=42, sense_freq=20, pos="n.",  definition="the occurrence of breaking",
    translate_cn="破裂", review_status="reviewed", source_status="ecdict_seeded"
  id=103, word_id=42, sense_freq=80, pos="v.",  definition="separate into parts",
    translate_cn="打破", review_status="reviewed", source_status="ecdict_seeded"

phonetics:
  word_id=42, accent=NULL, phonetic="breɪk", note="standard"

word_forms:
  word_id=42, form_type="past_tense",          form_value="broke"
  word_id=42, form_type="past_participle",     form_value="broken"
  word_id=42, form_type="present_participle",  form_value="breaking"

word_tags:
  word_id=42, tag="zk"
  word_id=42, tag="gk"
  word_id=42, tag="ielts"

Homograph — "lie" (two unrelated words, same spelling)

words:
  id=100, word="lie"    (to recline)
  id=101, word="lie"    (to tell an untruth)

definitions (lie #100 — recline):
  id=601, word_id=100, sense_freq=10, pos="v.",
    definition="to be in or assume a horizontal position",
    translate_cn="躺", review_status="reviewed", source_status="ecdict_seeded"

definitions (lie #101 — untruth):
  id=602, word_id=101, sense_freq=10, pos="n.",
    definition="an intentionally false statement",
    translate_cn="谎言", review_status="reviewed", source_status="ecdict_seeded"
  id=603, word_id=101, sense_freq=20, pos="v.",
    definition="to say something that is not true",
    translate_cn="撒谎", review_status="reviewed", source_status="ecdict_seeded"

word_relationships:
  word_id=100, related_word_id=102, relation_type="inflected_form", form_type="past_tense"
  (lie #100's past tense is "lay" (id=102))

Multi-role word — "bound" (standalone + past participle of "bind")

words:
  id=500, word="bound"

definitions (standalone meanings):
  id=701, word_id=500, sense_freq=15, pos="adj", definition="heading to a destination",
    translate_cn="前往…的", review_status="reviewed", source_status="llm_generated"
  id=702, word_id=500, sense_freq=40, pos="adj", definition="certain to happen",
    translate_cn="注定的", review_status="reviewed", source_status="llm_generated"
  id=703, word_id=500, sense_freq=60, pos="adj", definition="tied or constrained",
    translate_cn="被绑住的", review_status="reviewed", source_status="ecdict_seeded"

word_relationships:
  word_id=500, related_word_id=48, relation_type="inflected_form", form_type="past_participle"
  ("bound" is also the past participle of "bind" (id=48))

Reference only — "went"

words:
  id=888, word="went"
  (no rows in definitions → reference-only item)

word_relationships:
  word_id=888, related_word_id=89, relation_type="inflected_form", form_type="past_tense"
  ("went" is the past tense of "go" (id=89))

word_forms (on the parent "go"):
  word_id=89, form_type="past_tense", form_value="went"

Pronunciation variants — "present" (different POS, different pronunciation)

words:
  id=50, word="present"

definitions:
  id=801, word_id=50, sense_freq=10, pos="n.",  definition="a thing given to someone",
    translate_cn="礼物"
  id=802, word_id=50, sense_freq=20, pos="adj", definition="being in a particular place",
    translate_cn="出席的"
  id=803, word_id=50, sense_freq=50, pos="v.",  definition="to hand over or show",
    translate_cn="呈现"

phonetics:
  word_id=50, accent=NULL, phonetic="/ˈprezənt/", note="noun and adjective forms"
  word_id=50, accent=NULL, phonetic="/prɪˈzent/", note="verb form"
  word_id=50, accent="br", phonetic="/ˈprezənt/", note="British, noun/adj"
  word_id=50, accent="us", phonetic="/ˈpreznt/",  note="American, noun/adj"

Variant pronunciation — "either" (same POS, two preferences)

words:
  id=200, word="either"

phonetics:
  word_id=200, accent=NULL, phonetic="/ˈiːðər/", note="common variant"
  word_id=200, accent=NULL, phonetic="/ˈaɪðər/", note="also widely accepted"

6 Design Q&A

Decisions made during the design process, with reasoning.

Why keep frequency rankings on the words table?

These are used constantly for sorting and filtering — "show me the most important IELTS words." Keeping them on the words table means a single ORDER BY frq ASC with no JOIN. A junction table would add a JOIN to every frequency query, which hurts under high concurrency.

Why not store tags as a space-separated string?

LIKE '%ielts%' is slow and not indexable. At 770K words, scanning the entire table for every tag query is unacceptable. A separate word_tags table with an index on tag makes "all TOEFL words" instant. The extra table costs ~60MB but saves query time on every tag-based lookup.

Why use sense_freq instead of sense_order?

An order is static — you'd need a migration to change which definition appears first. A frequency is dynamic — it can be reweighted based on user behavior, corpus statistics, or LLM re-ranking. "The occurrence of breaking" might be more common than "an unexpected piece of good luck," and the data should reflect that.

Why allow duplicate word values in the words table?

English has homographs — words with the same spelling but unrelated meanings. "Lie" (recline) and "lie" (untruth) are etymologically different words that happen to share a surface form. They need separate IDs and separate definitions. A UNIQUE constraint on word would force them into one entry, losing the distinction.

Why keep a flat words table instead of splitting into terms + entries?

The split (terms → entry_terms → entries) is semantically cleaner but forces a JOIN on every lookup. Under high concurrency on a server, even millisecond-level JOIN overhead compounds. The flat words table means the hot path is two simple indexed queries with zero JOINs.

How does the design handle words like "bound" that are both standalone and related?

"bound" has its own row in words and its own definitions (standalone). It also has a row in word_relationships pointing to "bind" (inflected_form). The two facts coexist without conflict. The hot path only reads words + definitions; the relationship is fetched optionally on the cold path.

Why put phonetics in a separate table?

Pronunciation varies by accent, POS, register, and generation — not just one dimension. "either" has two pronunciations as the same POS. "present" differs by POS. Tying phonetic to a single column on words or definitions can't capture this. A separate table with a note column handles all cases: accent ("British"), POS ("verb form"), preference ("common variant"), etc.

Why use columns for translations instead of a separate table?

A separate sense_translations table requires a JOIN for every definition lookup. With translations as columns on definitions, you get the complete sense (English + Chinese) in a single query. Adding a new language (Japanese, Korean) is just an ALTER TABLE — add a column. At 3M definition rows, the extra NULL columns cost negligible space.

What does review_status and source_status track?

The LLM pipeline generates definitions in stages. source_status tracks origin: ecdict_seeded (migrated from ECDICT), llm_generated (created by LLM), human_edited (reviewed and modified). review_status tracks quality: unreviewed → reviewed or rejected. This lets you ship the dictionary incrementally — ECDICT data first, LLM data as it's validated.

How does word_forms differ from word_relationships?

word_forms is an inventory of inflected forms — "go" has went, gone, going, goes. It's used for reverse lookups: user types "went" → find "go." word_relationships tracks the semantic link between two words: "went is the past tense of go." They serve different purposes: one is a lookup index, the other is a relationship declaration.

Why does each word have at most one reference entry?

A morphological form always traces back to exactly one origin. "went" is the past tense of "go" — not of any other word. "interesting" derives from "interest" — not from multiple sources. Even "bound," which is both standalone AND related to "bind," has exactly one morphological parent for that particular relationship.

The direction matters: a surface string can map to multiple dictionary entries (homographs — "lie" has two unrelated entries). But a dictionary entry has at most one morphological parent. This asymmetry is why word_relationships uses two foreign keys (word_id and related_word_id) rather than a many-to-many junction — each word points to at most one origin, not many.

What is the estimated database size?

Table	Est. Rows	Size
words	~780K	~70 MB
definitions	~3M	~150 MB
word_tags	~2M	~60 MB
phonetics	~1.5M	~50 MB
word_relationships	~200K	~10 MB
word_forms	~100K	~10 MB
Total (with indexes)	~7.5M	~360 MB

Fine for a bundled SQLite DB. Can ship compressed and decompress on first launch.

7 Full DDL

CREATE TABLE words (
    id          INTEGER PRIMARY KEY,
    word        TEXT    NOT NULL,
    collins     INTEGER,
    oxford      INTEGER,
    bnc         INTEGER,
    frq         INTEGER,
    audio_path  TEXT
);

CREATE INDEX idx_words_word    ON words(word);
CREATE INDEX idx_words_collins ON words(collins);
CREATE INDEX idx_words_bnc     ON words(bnc);
CREATE INDEX idx_words_frq     ON words(frq);

CREATE TABLE word_relationships (
    id              INTEGER PRIMARY KEY,
    word_id         INTEGER NOT NULL,
    related_word_id INTEGER NOT NULL,
    relation_type   TEXT    NOT NULL,
    form_type       TEXT,
    FOREIGN KEY (word_id)         REFERENCES words(id),
    FOREIGN KEY (related_word_id) REFERENCES words(id),
    UNIQUE (word_id, related_word_id, relation_type)
);

CREATE INDEX idx_wr_word ON word_relationships(word_id);
CREATE INDEX idx_wr_rel  ON word_relationships(related_word_id);

CREATE TABLE word_tags (
    id      INTEGER PRIMARY KEY,
    word_id INTEGER NOT NULL,
    tag     TEXT    NOT NULL,
    FOREIGN KEY (word_id) REFERENCES words(id),
    UNIQUE (word_id, tag)
);

CREATE INDEX idx_wt_tag ON word_tags(tag);

CREATE TABLE definitions (
    id              INTEGER PRIMARY KEY,
    word_id         INTEGER NOT NULL,
    sense_freq      INTEGER NOT NULL,
    pos             TEXT,
    definition      TEXT    NOT NULL,
    translate_cn    TEXT,
    usage_register  TEXT,
    note            TEXT,
    review_status   TEXT,
    source_status   TEXT,
    FOREIGN KEY (word_id) REFERENCES words(id)
);

CREATE INDEX idx_def_word ON definitions(word_id);

CREATE TABLE phonetics (
    id        INTEGER PRIMARY KEY,
    word_id   INTEGER NOT NULL,
    accent    TEXT,
    phonetic  TEXT    NOT NULL,
    note      TEXT,
    FOREIGN KEY (word_id) REFERENCES words(id)
);

CREATE INDEX idx_phon_word ON phonetics(word_id);

CREATE TABLE word_forms (
    id          INTEGER PRIMARY KEY,
    word_id     INTEGER NOT NULL,
    form_type   TEXT    NOT NULL,
    form_value  TEXT    NOT NULL,
    FOREIGN KEY (word_id) REFERENCES words(id)
);

CREATE INDEX idx_wf_word  ON word_forms(word_id);
CREATE INDEX idx_wf_value ON word_forms(form_value);