# Population Mapping v1 — `source_sample` → canonical tables

For each of the 11 surveyed sources, this document specifies which canonical-schema columns get populated from which payload fields in `source_sample.raw_payload`. **This is the spec for Phase 2 ingestion scripts.**

Notation: `payload.foo` means `source_sample.raw_payload->foo` in SQL terms. `extracted.foo` means the `extracted` sub-object produced by the discovery script.

---

## Per-source `source` row values (Phase 2 setup)

These rows must exist in `source` before any object/inscription ingestion runs.

| source.id (existing) | name | unit_code | stack_class | inscription_modality | richness | structural | id_tier |
|---|---|---|---|---|---|---|---|
| 1 | Thesaurus Linguae Aegyptiae (TLA) | `TLA` | `sru-fcs` | `mdc` | 10 | 8 | `tier-1` |
| 3 | Trismegistos (TM) | `TM` | `proprietary-cms` | `none` | 0 | 7 | **`primary`** (the cross-source spine) |
| 6 | MET — Egyptian Art (Open Access) | `MET-EgyArt` | `proprietary-cms` | `none` | 0 | 8 | `tier-2` |
| 11 | British Museum (BM) — via Wikidata | `BM` | `proprietary-cms` (CF-gated) | `none` | 0 | 3 | `tier-2` |
| 15 | Brooklyn Museum — via Wikidata | `BKM` | `proprietary-cms` (Vercel-gated) | `none` | 0 | 3 | `tier-2` |
| 53 | Fitzwilliam Museum — Antiquities | `FITZ-AntE` | `axiell-ciim` | `mdc` | 9 | 9 | `tier-2` |
| 57 | Leiden RMO | `LEIDEN` | `wordpress-cf-gated` | (blocked) | 0 | 0 | `tier-2` |
| 60 | ISAC Chicago IDB | `ISAC-MC` | `ke-emu-solr` | `english_prose` | 7 | 10 | `tier-2` |
| 64 | Manchester Museum | `MANCHESTER` | `wordpress-cf-gated` | (untestable) | 0 | 0 | `tier-2` |
| 79 | Smithsonian Open Access | `SI` (parent) | `edan-solr` | `none` | 2 | 8 | `tier-2` |
| 81 | Tell el-Amarna Project | `AMARNA-EES` | `wordpress-excel` | `sign_name_vocab` | 6 | 7 | `tier-3` |

For Smithsonian, Phase 2 should also create N child source rows (one per unit observed: `NMNHANTHRO`, `SIL`, `SAAM`, `NMAH`) with `parent_source_id = 79`, and object rows then point to the **child** source.

---

## #1 — TLA (Thesaurus Linguae Aegyptiae)

Payload format: `sru-xml/record` (per-record SRU/LexFCS XML) and `sru-xml/explain` (one meta row).

| canonical table | column | source path |
|---|---|---|
| `object` | (none — TLA records are *lemmas*, not artifacts) | TLA populates `inscription` directly, with `object_id` pointing to whichever object row is cross-linked elsewhere |
| `inscription` | `transcription_mdc` | parse XML: `<lex:Field type="lemma"><lex:Value xml:lang="egy-Latn-leidentr" preferred="true">` |
| `inscription` | `script` | `"Egyptian Hieroglyphs"` (or per `InsScript` mapping; TLA covers hieratic+demotic too) |
| `inscription` | `language` | `"Egyptian"` |
| `inscription` | `dialect` | (when present) |
| `inscription` | `source_native_inscription_id` | `extracted.entry_id` (the TLA-WCN, e.g., `90260`) |
| `translation` | `language_code` + `text` | parse `<lex:Field type="translation"><lex:Value xml:lang="...">` — TLA emits de/en/fr in parallel; one `translation` row each |
| `translation` | `translation_method` | `"scholarly"` (TLA is curated) |
| `cross_reference` | `target_id_type="TLA-WCN"`, `target_id_value=entry_id`, `target_id_tier="tier-1"` | for any object linked to this lemma |
| `canonical_id` | (already populated in Phase 1) | TLA-WCN regex `^(dm|d)?\d+$`, HDL regex |

**Ingest note:** TLA inscriptions can attach to an existing object via the future cross-source linkage pipeline (transliteration-match on `transcription_mdc`). For Phase 2 initial load, store as inscription rows with `object_id` pointing to a placeholder "Unattached TLA lemma" object per source, and let the linkage pipeline reassign.

---

## #2 — Trismegistos (TM)

Payload formats: `rdf-xml/tm-place-basic`, `json/tm-texrelations`, `rdf-xml/tm-person`.

| canonical table | column | source path |
|---|---|---|
| `excavation_context` (when place RDF mentions a known site) | `site` | parse RDF for `skos:altlabel xml:lang="en"` |
| `excavation_context` | `pleiades_id` | parse `skos:closeMatch rdf:resource="https://pleiades.stoa.org/places/..."` |
| `excavation_context` | `location_point` | parse `osgeo:asGeoJSON` (lon, lat) → PostGIS POINT |
| `cross_reference` | `target_id_type="TM-Text"`, `target_id_value=text_id`, `target_id_tier="primary"`, `match_method="explicit-source-field"` | populated when an object/inscription has an explicit TM-Text cross-ref (from any source; TM itself is the catalog) |
| `cross_reference` | `target_id_type="TM-Geo"`, target `target_id_value=geo_id`, `target_id_tier="primary"` | when geo cross-ref |
| `canonical_id` | (already populated Phase 1) | TM-Text, TM-Geo, TM-Person ID schemes |

**TM's role in Phase 2:** the cross-reference primary key authority. TM is rarely the OBJECT source; it's the spine.

---

## #3 — MET Open Access

Payload format: `json/met-object` per object, `json/met-dept-index` for the department snapshot.

| canonical table | column | source path |
|---|---|---|
| `object` | `source_native_id` | `extracted.object_id` (e.g., `550813`) |
| `object` | `accession_number` | `extracted.accession` (e.g., `31.3.110`) |
| `object` | `title`, `object_name`, `description` | `extracted.title`, `extracted.object_name`, body `medium`/prose |
| `object` | `form_category` | derive from search category (`stela`, `papyrus`, etc.) |
| `object` | `period_name`, `dynasty`, `reign` | `extracted.period`, `extracted.dynasty`, `extracted.reign` |
| `object` | `start_year_bce`, `end_year_bce` | `extracted.object_begin_date`, `extracted.object_end_date` (already signed ints) |
| `object` | `find_country`, `find_region`, `find_site`, `find_locus` | `extracted.country`, `extracted.region`, `extracted.subregion`, `extracted.locale` |
| `object` | `current_repository`, `current_gallery` | `extracted.repository`, `extracted.gallery` |
| `object` | `is_public_domain`, `metadata_license`, `rights_statement` | `extracted.is_public_domain`, `extracted.rights` |
| `object` | `acquisition_year` | `extracted.accession_year` |
| `object` | `source_url` | `extracted.object_url` |
| `image` | `url_full`, `url_large`, `url_thumbnail` | `extracted.primary_image`, `primary_image_small`, plus `additional_images[]` (one row each) |
| `image` | `is_public_domain` | inherits from `object.is_public_domain` |
| `cross_reference` | `target_id_type="Wikidata-Q"`, `target_id_tier="tier-1"`, `match_method="explicit-source-field"` | `extracted.object_wikidata` URL → extract `Q\d+` |
| `inscription` | (NONE — MET doesn't model inscriptions) |  |

**Note:** the Phase 2.5 fuzzy-match pipeline will add `cross_reference` rows (`TM-Text`, with confidence < 1.0) by matching MET accession + find spot against TM records.

---

## #4 — British Museum (BM) — via Wikidata fallback

Payload format: `json/wikidata-entity` (Wikidata JSON-API entity records).

| canonical table | column | source path |
|---|---|---|
| `object` | `source_native_id` | `extracted.wikidata_q` (e.g., `Q48584` for Rosetta Stone) |
| `object` | `accession_number` | `extracted.bm_inv_primary` (e.g., `EA24`) |
| `object` | `title`, `description` | `extracted.label_en`, `extracted.description_en` |
| `object` | `form_category` | derive from `extracted.type_labels` (P31 instance-of values) |
| `object` | `materials` | `extracted.material_labels` (P186) |
| `object` | `height_cm`, `width_cm`, `depth_cm` | parse `extracted.height_mm` / 10 (Wikidata stores mm) |
| `object` | `find_country`, `find_site` | `extracted.discovery_place_label`, `extracted.country_origin_label` |
| `object` | `source_url` | `extracted.bm_url` (the BM permalink, even though gated) |
| `image` | `url_full` | `extracted.image_url` (Commons `Special:FilePath` URL) |
| `image` | `is_public_domain` | TRUE (Wikidata images are CC0/CC-BY on Commons) |
| `cross_reference` | `target_id_type="Wikidata-Q"` | every BM row from this source IS a Wikidata Q — store self-cross-ref for graph traversal |
| `cross_reference` | `target_id_type="TM-Text"`, `target_id_tier="primary"`, `match_method="explicit-source-field"` | from `extracted.tm_text_id` (Wikidata's P8532) — 2/57 had this in the survey, but those are gold |
| `cross_reference` | `target_id_type="TLA-Object"`, `target_id_tier="tier-1"` | from `extracted.tla_object_id` (Wikidata P12185) — 1/57 had this |

**Caveat:** Wikidata coverage of BM is ~0.02% of holdings. The ~13 Egyptian BM objects via this source feed primarily the cross-reference table as **seed pairs** for the fuzzy-match pipeline (decisions.md).

---

## #5 — Brooklyn Museum (BKM) — via Wikidata fallback

Payload format: `json/wikidata-entity` (same as BM).

| canonical table | column | source path |
|---|---|---|
| Same as BM, but `accession_number` from `extracted.bkm_artwork_id` (Wikidata P4740) | | BKM had 0 TM/TLA cross-refs in our 4-record sample. |

**Coverage:** ~4 Egyptian records. Sparse but stored for completeness.

---

## #6 — Fitzwilliam Museum (Antiquities)

Payload format: `json/fitzwilliam-object`.

| canonical table | column | source path |
|---|---|---|
| `object` | `source_native_id` | `extracted.object_id` |
| `object` | `accession_number` | `extracted.accession` (`E.6.1909`) |
| `object` | `persistent_id` | `extracted.api_uri` (Fitzwilliam stable URI) |
| `object` | `title`, `object_name`, `description` | `extracted.summary_title`, `extracted.names[0]`, body description |
| `object` | `form_category` | from search category + cross-check `extracted.categories` |
| `object` | `materials`, `techniques` | `extracted.materials`, `extracted.techniques` |
| `object` | `height_cm`, `width_cm`, `depth_cm` | `extracted.measurements[]` filter by `dimension` field |
| `object` | `period_name`, `dynasty` | `extracted.creation_periods[].label` (collapse to most-specific) |
| `object` | `find_country`, `find_site`, `find_locus` | `extracted.find_places[]` (denormalised from Fitzwilliam's place hierarchy) |
| `object` | **`find_point`** | `ST_MakePoint(extracted.find_places[0].lng, .lat)` — Fitzwilliam is the primary source of structured lat/lng |
| `object` | `metadata_license` | `"CC-BY-NC-ND 4.0"` (Fitzwilliam API ToS) |
| `inscription` | one row per entry in `extracted.inscriptions[]` |  |
| `inscription` | `sequence_number` | array index + 1 |
| `inscription` | `location_on_object` | `extracted.inscriptions[i].location` |
| `inscription` | `method` | `extracted.inscriptions[i].method` |
| `inscription` | `inscription_type` | `extracted.inscriptions[i].type` |
| `inscription` | **`transcription_mdc`** | `extracted.inscriptions[i].transcription` (MdC string) |
| `inscription` | `description_en` | `extracted.inscriptions[i].description` (often a visual description, not a translation) |
| `image` | three rows per `extracted.images[i]` — one per resolution (`preview`, `large`, `original`) | actually one image row with all three URL columns populated |
| `object_publication` | row per `extracted.publications[]` | populate `publication` table first if citation is new |

---

## #7 — Leiden RMO

Payload format: `access-probe` only (10 probe rows; no collection data).

| canonical table | column | source path |
|---|---|---|
| (none — no collection data reachable) | | The `source` row exists with `access_method='blocked'`. Phase 2 ingest skips Leiden until Playwright workstream unblocks it. |

---

## #8 — ISAC Chicago IDB

Payload formats: `json/isac-solr-mc` (Museum Collection records) and `json/isac-solr-ra` (Research Archive bibliographic records).

| canonical table | column | source path |
|---|---|---|
| `object` (from MC dataset) | `source_native_id` | `extracted.irn` (e.g., `901040`) |
| `object` | `accession_number` | `extracted.registration` (e.g., `E10486A-P`) |
| `object` | `persistent_id` | `extracted.guid` (UUID) |
| `object` | `title`, `description` | `extracted.description` |
| `object` | `form_category` | from `extracted.broad_classification` (`Stele` / `Manuscript` / `Sculpture` / `Jewelry` / `Vessel` / `Architectural Element`) |
| `object` | `materials` | `extracted.materials` (array) |
| `object` | `period_name`, `dynasty`, `reign` | `extracted.period`, `extracted.dynasty`, `extracted.king_ruler` |
| `object` | `find_country`, `find_region`, `find_site`, `find_locus` | `extracted.country`, `region`, `site`, `locus` (controlled vocab; Pro* fields in raw) |
| `object` | `excavation_context_id` | when `extracted.locus` is set, find-or-create excavation_context row with `site = extracted.site`, `locus = extracted.locus` |
| `object` | `current_repository` | `"ISAC Chicago"` |
| `object` | `current_gallery`, `is_on_display` | `extracted.gallery`, `extracted.is_on_display` |
| `inscription` | one row per inscribed object (use `extracted.inscriptions[]` from the extractor — sequence-numbered from `InscMarkDescription0/1/...`) |  |
| `inscription` | `location_on_object` | per-entry `location` |
| `inscription` | `script` | `extracted.inscription_script` |
| `inscription` | `language` | `extracted.inscription_language` |
| `inscription` | `dialect` | `extracted.inscription_dialect` |
| `inscription` | `inscription_type` | `extracted.inscription_type` |
| `inscription` | **`description_en`** | per-entry `description` (English content) |
| `publication` (from RA dataset) | one row per RA record | `extracted.title`, `extracted.author`, `extracted.date`, `extracted.url` (JSTOR), `extracted.is_online` |
| `publication` | `archive_type` | derive from URL: `jstor.org` → `"jstor"`, `hathitrust.org` → `"hathitrust"` |
| `object_publication` | when MC object has citations in RA records — link via fuzzy match on (Site + ColClassification + AccessionNumber) | (Phase 2.5 work) |

---

## #9 — Manchester Museum

Payload formats: `access-probe` + `json/europeana-record`.

| canonical table | column | source path |
|---|---|---|
| `object` (sparse) | from Europeana records only | 30 records sampled, all musical instruments (not Egyptian); the source row exists but Phase 2 may skip ingest entirely |
| | | Awaits Playwright unblock for real Manchester data |

---

## #10 — Smithsonian Open Access

Payload format: `json/smithsonian-edanmdm`.

Phase 2 should create **child source rows** per unit observed and store objects under the child. Parent row (Smithsonian, source.id=79) is the federation root; objects' `source_id` points to children.

| canonical table | column | source path |
|---|---|---|
| `source` (Phase 2 setup) | create one per unit observed | `unit_code` from `extracted.unit_code`; `parent_source_id` = 79; `data_source` from `extracted.data_source` |
| `object` | `source_id` | the CHILD source.id (not the parent) |
| `object` | `source_native_id` | `extracted.smithsonian_id` (e.g., `ld1-1643407100205...`) |
| `object` | `accession_number` | from `extracted.identifiers[]` where `label='Accession Number'` |
| `object` | `persistent_id` | `extracted.guid_ark` (e.g., `http://n2t.net/ark:/65665/...`) |
| `object` | `title`, `description` | `extracted.title`, prose from `extracted.notes` |
| `object` | `form_category` | from search category + cross-check `extracted.object_type` (controlled-vocab values like `Mummies`, `Figurines`) |
| `object` | `materials` | `extracted.physical_description` (parse from prose — Smithsonian lacks structured material) |
| `object` | `period_name` | derive from `extracted.indexed_date` (decade-binned) and culture |
| `object` | `find_country`, `find_region` | `extracted.geo_hierarchy[0].L2.content`, `L3.content` |
| `object` | `find_point` | (none — Smithsonian provides hierarchical place, not coords) |
| `object` | `current_repository` | "Smithsonian Institution — {data_source}" |
| `object` | `metadata_license` | `extracted.metadata_usage` (always `"CC0"`) |
| `object` | `is_public_domain` | TRUE (CC0 universal) |
| `image` | one row per non-null `extracted.image_url`-ish field | (Smithsonian record has `has_images` boolean; URL extraction needs additional payload parsing — Phase 2 work) |
| `cross_reference` | `target_id_type="Wikidata-Q"`, `target_id_tier="tier-1"` | when present (rare per the survey — 0 Egyptian) |

---

## #11 — Tell el-Amarna Project

Payload format: `xls/amarna-object-database` (one row per Excel record).

| canonical table | column | source path |
|---|---|---|
| `excavation_context` | one row per (site, area, grid, locus) combination | find-or-create with `site="Tell el-Amarna"`, `area_subarea = extracted.area_suburb`, `grid_ref = extracted.subdivision`, `locus` derived from `extracted.further_details` when feature-named, `context_description = extracted.stratigraphic_details` |
| `excavation_context` | `excavation_year` | `extracted.year_of_excavation` |
| `excavation_context` | `excavation_project` | `"EES Amarna Project"` (when year < 1980) or `"McDonald Institute Amarna Project"` (when ≥ 1980) |
| `object` | `source_native_id` | `extracted.registration_number` (e.g., `21/8`) |
| `object` | `accession_number` | same as `source_native_id` (the registration IS the accession) |
| `object` | `title`, `description` | construct from `extracted.identification_keywords` + `notes_identification` |
| `object` | `object_name` | first colon-segment of `extracted.identification_keywords` |
| `object` | `form_category` | `extracted.form_category` (already mapped) |
| `object` | `materials` | parse `extracted.material` (may be slash-separated: `"copper/bronze"`) |
| `object` | `height_cm`, `width_cm`, `depth_cm`, `diameter_cm`, `weight_g` | `extracted.measurements.height` / `length` / etc. (already cm/g per the Excel) |
| `object` | `period_name`, `dynasty` | default `"New Kingdom"`, `"Dynasty 18"` unless `extracted.date_alt` overrides |
| `object` | **`excavation_context_id`** | from find-or-create above |
| `object` | `find_country`, `find_site` | `"Egypt"`, `"Tell el-Amarna"` |
| `object` | **`petrie_grid`** | `extracted.subdivision` (e.g., `O49.22`) |
| `inscription` | one row when `extracted.has_inscription_content == TRUE` | sequence_number = 1 |
| `inscription` | `location_on_object` | `"obverse"` for Amarna's structure |
| `inscription` | **`decoration_signs`** | parse `extracted.decoration_obverse` + `decoration_reverse` into `[{"category": "...", "name": "..."}, ...]` — split on colon |
| `inscription` | `script` | infer `"Egyptian Hieroglyphs"` when decoration mentions hieroglyph categories |
| `inscription` | `description_en` | populate with the raw `decoration_obverse` text as backup |
| `publication` | one row per (publication_refs entry); deduplicate by canonical citation | citation form: `"Pendlebury, City of Akhenaten III"`; `series="EES Excavation Memoirs"`; `archive_type` lookup against HathiTrust/Internet Archive in Phase 2.5 |
| `object_publication` | one row per COA reference per object | `reference_detail = extracted.publication_refs[i][1]` (the "p. 29, pl. XIII.5" string) |

**Note:** Amarna has NO direct TM/Wikidata/Pleiades cross-refs. The cross-reference table gets populated for Amarna via the fuzzy-match pipeline (Phase 2.5) using:
- registration number + year + identification_keywords → text-database matches
- decoration_signs (named hieroglyphs / cartouches) → TM person-records for kings

---

## Summary table (which canonical tables each source populates)

| source | object | inscription | excavation_context | translation | cross_reference | image | publication |
|---|---|---|---|---|---|---|---|
| TLA | — | ✓ (mdc) | — | ✓ (multi-lang) | ✓ (TLA-WCN) | — | — |
| TM | — | — | ✓ (places) | — | ✓ (TM spine) | — | — |
| MET | ✓ | — | — | — | ✓ (Wikidata) | ✓ | — |
| BM (via WD) | ✓ (sparse) | — | — | — | ✓ (TM seeds) | ✓ | — |
| BKM (via WD) | ✓ (sparse) | — | — | — | ✓ | ✓ | — |
| Fitzwilliam | ✓ | ✓ (mdc) | — | — | — | ✓ (3-res) | ✓ |
| Leiden | — | — | — | — | — | — | — |
| ISAC | ✓ | ✓ (english) | ✓ (locus) | — | — | — | ✓ (RA) |
| Manchester | (deferred) | — | — | — | — | — | — |
| Smithsonian | ✓ | — | — | — | ✓ (when present) | ✓ | — |
| Tell el-Amarna | ✓ | ✓ (sign_name) | ✓ (Petrie grid) | — | — | — | ✓ (COA) |

**Total: 11 source rows; ~9 sources contribute to `object`; 4 contribute to `inscription` (with 3 different content modalities); 3 contribute to `excavation_context`; 3 contribute to `publication`.**

---

## Phase 2 ingestion sequence

1. **Reset/recreate source rows** with `unit_code`, `stack_class`, `inscription_richness_score`, `structural_cleanliness_score`, `inscription_content_modality`, `id_tier_in_corpus`. For Smithsonian, also create per-unit child rows.
2. **Ingest excavation contexts** from Amarna + ISAC. These are referenced by objects so must exist first.
3. **Ingest objects** in source order: MET, Fitzwilliam, ISAC, Smithsonian (via children), Tell el-Amarna, BM via WD, BKM via WD.
4. **Ingest inscriptions** from Fitzwilliam (MdC) + ISAC (English) + Tell el-Amarna (sign-vocab) + TLA (attach to existing or placeholder objects).
5. **Ingest translations** from TLA (de/en/fr per lemma).
6. **Ingest images** from MET, Fitzwilliam, Smithsonian, BM/BKM via Commons.
7. **Ingest publications + junctions** from Fitzwilliam, ISAC RA, Tell el-Amarna COA.
8. **Generate explicit-source cross-refs** from Wikidata-mediated tier-1 pairs (BM Rosetta → TM 8809, etc.).
9. **Compute and store inscription embeddings** (1536-dim) using OpenAI text-embedding-3-small or local equivalent.
10. **Drop and rebuild IVFFlat index** with `lists = sqrt(N)` once population complete.
11. **(Phase 2.5)** Run fuzzy-match pipeline to populate <1.0-confidence cross-refs.