# Egyptian Inscriptions Database — Canonical Schema v1

**Status:** schema applied to Postgres on `publiclandsvisitors` as of 2026-05-24. No data populated yet — Phase 2 ingestion scripts will fill from `source_sample`.

**Provenance:** every non-trivial column comes from a specific finding in one of the 11 source surveys. Source citations are inline (see `decisions.md` and `sources/discover_*.py` for the full trail).

**Migration:** `sql/003_canonical_schema_v1.sql`. Idempotent (CREATE TABLE IF NOT EXISTS, ADD COLUMN IF NOT EXISTS, etc.). Requires `pgvector` and `postgis` extensions.

---

## Design philosophy

1. **One table per first-class concept.** `object`, `inscription`, `excavation_context`, `publication` are all real entities with identity, not denormalised attributes.
2. **The `object ↔ inscription` join is 1:N.** Confirmed by Fitzwilliam (Stela E.6.1909 has 7 inscriptions), ISAC (numbered `InscMarkDescription0/1/2...`), Tell el-Amarna (obverse + reverse). One object, many discrete inscriptions.
3. **Three inscription content modalities coexist.** Sources populate different ones — Fitzwilliam emits MdC transliteration, ISAC emits English description prose, Tell el-Amarna emits structured sign-name vocabulary. `inscription` has all three columns; any or all may be NULL.
4. **Federation over consolidation.** `source` carries Smithsonian-template fields (`unit_code`, `data_source`, `persistent_id_scheme`, `parent_source_id`) so multi-unit sources stay queryable as one or as N depending on the question.
5. **Cross-source IDs are tiered.** TM IDs are primary (the spine, per the Trismegistos survey). Wikidata QIDs are tier-1. Museum accessions are tier-2. Local registration numbers (Amarna `21/8`, Fitzwilliam `E.6.1909`) are tier-3.
6. **Stratigraphic detail goes in `excavation_context`, not on `object`.** Many objects can share one context (a hoard from one tomb). Museum-sourced objects leave `object.excavation_context_id` NULL; their prose find spots live denormalised on `object.find_*` columns.
7. **Vector search on inscriptions.** A `VECTOR(1536)` column on `inscription` lets us find semantically-similar inscriptions even across modalities (embed `transcription_mdc` and `description_en` into the same vector space at ingest).
8. **PostGIS geographic points on objects and contexts.** Direct lat/lng (when sources provide them, primarily Fitzwilliam + future Pleiades cross-ref) plus optional `find_pleiades_id` for normalised gazetteer lookup.

---

## Tables

### `source` (Phase 1 table — extended in v1)

Catalogs every external data source. Phase 1 columns retained; Phase 2 v1 adds federation and quality-score fields.

| column | type | reasoning |
|---|---|---|
| `id` | BIGSERIAL PK | |
| `name` | TEXT UNIQUE | human-readable source name |
| `url` | TEXT | homepage |
| `contact` | TEXT | contact URL / email |
| `type` | TEXT | `museum_collection` / `text_db` / `excavation_archive` / `federation_aggregator` / `bibliography` |
| `license` | TEXT | metadata license |
| `access_method` | TEXT | `rest_api` / `rest_api_with_key` / `bulk_dump` / `blocked` / `unreachable` |
| `status` | TEXT | `pending` / `surveying` / `surveyed` / `blocked` / `probed` |
| `last_surveyed_at` | TIMESTAMPTZ | |
| `notes` | TEXT | per-source survey notes |
| **`unit_code`** | TEXT UNIQUE WHERE NOT NULL | short slug for federation (e.g., `TLA`, `MET-EgyArt`, `ISAC-MC`, `AMARNA-EES`). Adopted from Smithsonian template (decisions.md 2026-05-24). |
| **`parent_source_id`** | BIGINT FK→source | hierarchical federation. Smithsonian = one parent row + N unit rows; objects' `source_id` points to the leaf unit. |
| **`data_source`** | TEXT | finer-grained display name (e.g., `"NMNH - Anthropology Dept."` for unit `NMNHANTHRO`). |
| **`persistent_id_scheme`** | TEXT | `"ARK"` / `"DOI"` / `"Handle"` / `"TLA-WCN"` / `"TM-ID"` — what kind of stable ID this source mints. |
| **`stack_class`** | TEXT | `"proprietary-cms"` / `"axiell-ciim"` / `"ke-emu-solr"` / `"edan-solr"` / `"wordpress-excel"` / `"wordpress-cf-gated"` / `"sru-fcs"`. Six stack classes observed across surveys. |
| **`inscription_richness_score`** | INT 0-10 | 0 = no inscription content; 10 = full MdC transliteration. Two-axis quality scoring per the brief. |
| **`structural_cleanliness_score`** | INT 0-10 | 0 = HTML scrape only; 10 = fully structured Solr/JSON. |
| **`inscription_content_modality`** | TEXT | `mdc` / `english_prose` / `sign_name_vocab` / `none` / `mixed`. Which of the three modalities this source populates. |
| **`id_tier_in_corpus`** | TEXT | `primary` / `tier-1` / `tier-2` / `tier-3`. Where this source's IDs play in the cross-source spine. |
| `api_url`, `bulk_dump_url`, `record_url_template` | TEXT | concrete URLs for ingestion. |

### `source_sample` (Phase 1 — unchanged)

Raw API payloads from discovery surveys. Phase 2 ingestion reads from here.

### `canonical_id` (Phase 1 — extended)

Catalog of ID *schemes* (not values; values live in `cross_reference`).

| column | type | reasoning |
|---|---|---|
| `id_type` | TEXT | e.g., `"TM-Text"`, `"Wikidata-Q"`, `"MET-Object"` |
| `example`, `format_regex`, `notes` | TEXT | as before |
| **`id_tier`** | TEXT | `primary` / `tier-1` / `tier-2` / `tier-3` — the cross-source-ID-tier decision. TM-Text is `primary` (decisions.md). |
| **`url_template`** | TEXT | `"https://thesaurus-linguae-aegyptiae.de/lemma/{id}"` |
| **`match_strategy`** | TEXT | `"exact"` / `"fuzzy-accession"` / `"wikidata-mediated"` |

---

### `excavation_context` (new)

The first excavation-archive shape (Tell el-Amarna survey, source #11) needed a richer find-spot model than any museum source provides. Stored as a SEPARATE table so:
- Many objects can share one context (a hoard from Tomb 24)
- Contexts are queryable independently for hierarchy traversal
- Museum-sourced objects leave `object.excavation_context_id` NULL

| column | type | source-driven reasoning |
|---|---|---|
| `id` | BIGSERIAL PK | |
| **`site`** | TEXT NOT NULL | "Tell el-Amarna", "Thebes", "Medinet Habu" — always populated. |
| **`area_subarea`** | TEXT | Tell el-Amarna's controlled vocab: `"M.C."` (Main City), `"C.C."` (Central City), `"N.S."` (North Suburb), `"W.V."` (Workmen's Village), `"Maru Aten"`. Other sites bring their own. |
| **`grid_ref`** | TEXT | **Petrie grid** at Amarna (`"O49.22"`). Many Egyptian sites use this convention. Cross-source spatial join key. |
| **`locus`** | TEXT | Specific feature: `"Tomb of Anpy"`, `"Foundation Deposit"`, `"Tomb 24 Room 19"`. ISAC populates this from `ProLocus`. |
| **`context_description`** | TEXT | Stratigraphic prose: `"pit (?well) just to west of west enclosure wall of O.49.7&8"`. Tell el-Amarna's `Stratigraphic details` column. |
| `excavation_year`, `excavation_season`, `excavator`, `excavation_project` | TEXT | Season metadata. Tell el-Amarna populates year directly; other archives may use named seasons. |
| **`location_point`** | GEOGRAPHY(POINT, 4326) | PostGIS WGS84 lat/lon. Sourced from Pleiades or direct gazetteer lookup. |
| **`pleiades_id`** | TEXT | Pleiades gazetteer ID — bridges to the broader ancient-Mediterranean place network (per TM survey: `skos:closeMatch` to Pleiades is the cross-gazetteer spine). |
| `parent_context_id` | BIGINT FK | for tree-style hierarchy if a future source needs it (currently no source we've surveyed requires this). |

**Unique key:** `(site, area_subarea, grid_ref, locus)` via a unique index with COALESCE (Postgres doesn't allow function calls in UNIQUE constraints).

---

### `object` (new)

The core artifact row. Every museum object, every excavation-recovered artifact, every text-database-mentioned object lands here.

| column | type | source-driven reasoning |
|---|---|---|
| `id` | BIGSERIAL PK | |
| `source_id` | BIGINT NOT NULL FK→source | which source supplied this object record. |
| `source_native_id` | TEXT NOT NULL | the source's internal ID (e.g., MET `objectID`, Fitzwilliam `object-51004`, Amarna registration `21/8`). UNIQUE with `source_id`. |
| `source_payload_id` | BIGINT FK→source_sample | back-pointer to raw payload for re-extraction. |
| **`accession_number`** | TEXT | Museum accession (`E.6.1909`, `EA24`, `21/8`). Cross-museum join candidate per the cross-museum-fuzzy-match decision. |
| **`persistent_id`** | TEXT | ARK / DOI / Handle (e.g., `ark:/65665/3d783914d-...` from Smithsonian). Long-term stable ID when source mints one. |
| `title`, `object_name` | TEXT | display strings (`"Stela"`, `"Papyrus of Ani"`). |
| **`form_category`** | TEXT CHECK | Controlled to `stela / papyrus / statue / jewelry / vessel / architecture / other` — the six MET-aligned categories used across all surveys for direct comparability. |
| `description` | TEXT | curator/excavator prose. ISAC's `CatDescription`, MET's prose fields, etc. |
| **`materials`** | TEXT[] | Array because objects routinely have multiple materials (`{"limestone", "pigment"}`). GIN-indexed for fast filter. From Fitzwilliam `materials` field, ISAC `ProMaterials`, Tell el-Amarna `Material/s`. |
| **`techniques`** | TEXT[] | parallel structure to materials. Fitzwilliam exposes this directly. |
| `height_cm`, `width_cm`, `depth_cm`, `diameter_cm`, `weight_g` | NUMERIC | Unit-normalised dimensions. Fitzwilliam emits structured `measurements.dimensions[]`; Tell el-Amarna has 6 measurement columns; ISAC has none directly. |
| **`period_name`** | TEXT | "New Kingdom", "Ptolemaic". ISAC's `DatPeriod`; Fitzwilliam's periods (multi-valued — collapse to most-specific). |
| **`dynasty`** | TEXT | "Dynasty 18". ISAC's `DatDynasty`, Fitzwilliam dynasty period URIs. |
| **`reign`** | TEXT | "Akhenaten". ISAC's `DatKingRuler`. |
| `date_prose` | TEXT | Raw display date (`"ca. 945-715 B.C."`). |
| **`start_year_bce`, `end_year_bce`** | INT signed | Negative = BCE. MET's `objectBeginDate/objectEndDate` (signed ints, already this format), Fitzwilliam's from-to ranges. |
| **`excavation_context_id`** | BIGINT FK→excavation_context | NULL for museum-sourced; populated for excavation-sourced. |
| `find_country`, `find_region`, `find_site`, `find_locus` | TEXT | Denormalised find-spot for fast query. ISAC's structured Pro* fields, Fitzwilliam's `lifecycle.collection.places`. |
| **`find_point`** | GEOGRAPHY(POINT, 4326) | PostGIS. **Fitzwilliam is the only museum source providing direct lat/lng** (50/109 sampled objects). Smithsonian provides hierarchical place but no coords. |
| **`find_pleiades_id`** | TEXT | Pleiades cross-ref (per Trismegistos `skos:closeMatch` pattern). |
| **`petrie_grid`** | TEXT | "O49.22" — cross-source field for Petrie grid reference, populated for Amarna and any future Petrie-site source. |
| `current_repository`, `current_gallery`, `is_on_display` | | "Current location" axis, distinct from find-spot. MET, ISAC, Smithsonian all expose this. |
| `metadata_license`, `is_public_domain`, `rights_statement` | | Per-object licensing. MET has `isPublicDomain` boolean; Smithsonian has `metadata_usage.access = CC0` universally. |
| `acquisition_method`, `acquisition_date`, `acquisition_year` | | Provenance audit trail. |
| `source_url` | TEXT | direct URL to source's record page. |
| `last_modified_in_source` | TIMESTAMPTZ | MET's `metadataDate`. |

**Unique:** `(source_id, source_native_id)`.

**Indexes:** `form_category`, `period_name`, `dynasty`, `find_country`, `find_site`, `accession_number`, `persistent_id`, `excavation_context_id`, `petrie_grid`, year-range, GIST on `find_point`, GIN on `materials` + `techniques`.

---

### `inscription` (new — the heart of the project)

One row per discrete inscription on an object. 1:N with `object`.

| column | type | source-driven reasoning |
|---|---|---|
| `id` | BIGSERIAL PK | |
| `object_id` | BIGINT NOT NULL FK→object ON DELETE CASCADE | |
| `sequence_number` | INT NOT NULL DEFAULT 1 | ordering within an object. Fitzwilliam emits unordered list; ISAC numbers as `InscMarkDescription0/1/2...`. UNIQUE with object_id. |
| `source_id` | BIGINT FK→source | **may differ from object's source** — text-database sources (TLA) can be the inscription source even when the object's canonical record lives in a museum source. |
| `source_payload_id` | BIGINT FK→source_sample | raw payload back-pointer |
| `source_native_inscription_id` | TEXT | |
| **`location_on_object`** | TEXT | "left jambs", "obverse", "panel below offering scene". Fitzwilliam's `location`, ISAC's `InscMarkLocation`. |
| **`method`** | TEXT | "incised", "painted", "inked". Fitzwilliam's `method`. |
| **`inscription_type`** | TEXT | "Letter", "Funerary", "Votive", "Religious", "Administrative". ISAC's `InscMarkType`. |
| **`script`** | TEXT | "Egyptian Hieroglyphs", "Demotic", "Hieratic", "Coptic", "Greek", "Arabic". ISAC's `InsScript` (controlled vocab observed). |
| **`language`** | TEXT | "Egyptian", "Greek", "Latin", "Arabic". ISAC's `InscMarkLanguage`. |
| **`dialect`** | TEXT | "Middle Egyptian", "Late Egyptian", "Sahidic", "Koine". ISAC's `InsDialect`. |
| **`transcription_mdc`** | TEXT | **Fitzwilliam-style** Manuel de Codage transliteration: `"Htp di nsw in wsir nb Ddw t Hnqt xt nb(t) wabt..."`. Same form TLA emits. |
| **`description_en`** | TEXT | **ISAC-style** English description / translation prose: `"for the benefit of Yartiuerow"`. |
| **`decoration_signs`** | JSONB | **Tell el-Amarna-style** structured sign-name vocabulary: `[{"category": "cartouche", "name": "Akhenaten"}, {"category": "hieroglyph", "name": "nefer"}]`. GIN-indexed (jsonb_path_ops). |
| **`content_embedding`** | VECTOR(1536) | pgvector — for cross-modality semantic search. 1536 dims = OpenAI `text-embedding-3-small`. Index: IVFFlat cosine. |

**CHECK constraint:** at least one of `transcription_mdc`, `description_en`, `decoration_signs`, `location_on_object`, `script`, or `inscription_type` must be non-NULL. We accept "structural-only" inscription rows (a Stela that we know has an inscription but no content yet).

**Indexes:** `object_id`, `script`, `language`, `dialect`, `inscription_type`, GIN on `decoration_signs`, GIN tsvector full-text on `description_en` (English analyzer), GIN tsvector on `transcription_mdc` (simple analyzer — MdC isn't a natural language for FTS purposes), IVFFlat on `content_embedding`.

---

### `translation` (new)

Multiple translations per inscription, keyed by `(inscription_id, language_code, translator)`. TLA emits multi-language translations (de/en/fr); scholarly publications add more.

| column | type | reasoning |
|---|---|---|
| `language_code` | TEXT NOT NULL | ISO 639-1 / 639-3 (`"en"`, `"de"`, `"fr"`, `"ar"`) |
| `text` | TEXT NOT NULL | the translation |
| `translator` | TEXT | who produced it. Multiple translators per (inscription, language) get separate rows. |
| `translation_year` | INT | dates the translation to support competing-translation queries. |
| `citation` | TEXT | publication where the translation appears. |
| `translation_method` | TEXT | `"scholarly"` / `"machine"` / `"annotated"`. |
| `is_authoritative` | BOOLEAN | one canonical translation per language can be marked. |
| `source_id` | BIGINT FK→source | which source provided this translation. |

---

### `cross_reference` (new — the cross-source spine)

Actual cross-source ID VALUES attached to records. The catalog of *schemes* lives in `canonical_id`; the values live here.

| column | type | reasoning |
|---|---|---|
| `subject_table` | TEXT CHECK | `"object"` / `"inscription"` / `"excavation_context"`. Polymorphic subject. |
| `subject_id` | BIGINT | |
| `target_id_type` | TEXT | "TM-Text", "Wikidata-Q", "MET-Object", "Pleiades-Place", "TLA-WCN" |
| `target_id_value` | TEXT | the actual ID value |
| **`target_id_tier`** | TEXT CHECK | `primary` / `tier-1` / `tier-2` / `tier-3`. Lets queries select cross-refs at desired specificity. |
| **`confidence`** | NUMERIC 0-1 | 1.0 = explicit source field; <1.0 = fuzzy-matched. The future cross-museum→TM fuzzy pipeline will populate <1.0 values. |
| **`match_method`** | TEXT | `"explicit-source-field"` / `"fuzzy-accession"` / `"wikidata-mediated"` / `"transliteration-match"`. Audit trail for fuzzy refs. |
| `evidence` | TEXT | what data supported the match (so we can re-evaluate if upstream data changes). |
| `source_id` | BIGINT FK→source | which source produced this cross-ref. |

**Unique:** `(subject_table, subject_id, target_id_type, target_id_value)`.

---

### `image` (new)

Object-level OR inscription-level images. The same object's images may include some that show specific inscriptions; FK to `inscription_id` is nullable.

| column | type | reasoning |
|---|---|---|
| `object_id` | BIGINT FK→object | nullable; CHECK at least one of the two FKs is set |
| `inscription_id` | BIGINT FK→inscription | for images that specifically show one inscription |
| `url_full`, `url_large`, `url_thumbnail`, `url_iiif_manifest` | TEXT | multiple resolutions + IIIF Presentation API URL when source exposes one (none of the surveyed sources do consistently, but the column is here for future) |
| `caption`, `image_type`, `rights`, `is_public_domain`, `width_px`, `height_px` | | per-image metadata |
| `sha256_hash` | TEXT | for dedup across sources (same image reused by MET, Smithsonian, Wikipedia all pointing at one canonical file) |

---

### `publication` + `object_publication` + `inscription_publication` (new — print-OCR bridge)

ISAC's `RA.BibUrl` → JSTOR pattern and Tell el-Amarna's `COA I/II/III` references demand structured publication linkage. M:M between objects and publications because one publication references many objects; one object can be in multiple publications.

`publication`:
- canonical citation string (UNIQUE)
- author, title, year, publisher, volume, issue, series (`"EES Excavation Memoirs"`)
- `publication_type`: `"book"` / `"article"` / `"memoir"` / `"catalog"` / `"memoir-series"`
- `url`, `doi`, `archive_type` (`"jstor"` / `"hathitrust"` / `"internet-archive"` / `"ees"`), `is_online`

Junction tables (`object_publication`, `inscription_publication`) carry `reference_detail` like `"p. 29, pl. XIII.5"` so the citation+page-locator pair is preserved.

---

## Key decisions reflected (checklist from the brief)

| decision | column / table | evidence in source surveys |
|---|---|---|
| **Object and inscription as separate tables (1:N)** | `object` + `inscription.object_id` | Fitzwilliam stela has 7 inscriptions; ISAC's `InscMarkDescription0/1/2...` numbering. |
| **Three inscription content modalities** | `inscription.transcription_mdc` / `description_en` / `decoration_signs` | Fitzwilliam MdC + ISAC English prose + Tell el-Amarna sign-name vocab. |
| **Conditional stratigraphic columns** | `object.excavation_context_id` FK (nullable) + `object.find_*` denormalised | Tell el-Amarna's structured Provenance fields drive the `excavation_context` table; museum sources leave the FK NULL. |
| **Source federation fields (Smithsonian template)** | `source.unit_code` + `parent_source_id` + `data_source` + `persistent_id_scheme` | Smithsonian's 48-unit federation + ARK persistent IDs. |
| **Two-axis quality scores** | `source.inscription_richness_score` + `structural_cleanliness_score` | Two-axis decision per the brief. |
| **Canonical ID tiers** | `canonical_id.id_tier` + `cross_reference.target_id_tier` | TM as primary (decisions.md), Wikidata tier-1, museum accessions tier-2. |
| **Multiple translations per inscription** | `translation` table (N:1 with inscription) | TLA emits de/en/fr in parallel for each lemma. |
| **Structured period: name + dynasty + start/end years** | `object.period_name` + `dynasty` + `reign` + `start_year_bce` + `end_year_bce` | ISAC has all four; Fitzwilliam has period URIs; MET has signed-int year ranges. |
| **Geography: Pleiades ID + PostGIS point + Petrie grid** | `object.find_pleiades_id` + `find_point` + `petrie_grid` + `excavation_context.{pleiades_id, location_point, grid_ref}` | Fitzwilliam coords (50/109), Trismegistos Pleiades cross-refs, Amarna Petrie grid. |
| **pgvector embedding column on inscription** | `inscription.content_embedding VECTOR(1536)` | Cross-modality semantic search (Fitzwilliam MdC ↔ ISAC English ↔ Amarna sign-vocab all into one vector space). |

---

## Things this schema deliberately does NOT include

- **A `person` / `agent` table.** Egyptian inscriptions reference deities, kings, named individuals — but our current surveys don't expose person-level cross-refs cleanly enough to model. TM has TM-Person IDs; we capture those as `cross_reference` rows for now. Revisit when we have N≥2 person-aware sources.
- **A separate `material` controlled-vocabulary table.** Materials are a TEXT[] on `object` with GIN indexing. If we need formal hierarchical materials (e.g., "faience" ⊂ "ceramic"), we add a `material_vocab` table later.
- **Versioned records.** We don't track edit history. Source sample payloads ARE the version history — re-ingest replaces canonical rows.
- **A `user` / `audit_log` table.** Phase 1 is a research database, not a multi-user platform. Add when needed.
- **Full-text materialised view across object + inscription + translation.** Each table is FTS-indexed independently; cross-table search lives in `test_queries_v1.sql` as UNION queries.

---

## Operational notes

- **Apply migration as Postgres superuser** (CREATE EXTENSION needs it):
  `sudo -u postgres psql -d egyptian_inscriptions -f sql/003_canonical_schema_v1.sql`
- **The IVFFlat index emits a warning** with no data: "ivfflat index created with little data". Drop and recreate after population (`DROP INDEX inscription_content_embedding_ivfflat; CREATE INDEX ... WITH (lists = sqrt(N));`).
- **Tables are owned by `egyptdb`** so app code can use them; PostGIS's `spatial_ref_sys` stays owned by `postgres`.

---

## Migration history

| migration | applied | what |
|---|---|---|
| `001_initial_schema.sql` | 2026-05-23 | Phase 1 discovery tables: `source`, `source_sample`, `canonical_id` |
| `002_add_probed_status.sql` | 2026-05-23 | Add `'probed'` to `source.status` CHECK |
| `003_canonical_schema_v1.sql` | 2026-05-24 | Phase 2 canonical tables (this document) |