ETL Stages: Corrected and Confirmed

1. Extract (not "Load")

Generic, config-driven

What it does: - Read from file, API, or other source - Optionally unwrap payloads (e.g. results) - Handle pagination - No schema assumptions - No identity logic - No mutation beyond structural access

Config controls: - Source type (file / HTTP) - Pagination strategy - Wrapper path (results, etc.) - Rate limits, retries

Key invariant: Extraction never interprets meaning. This stage should be boring enough you forget it exists.

2. Normalize

Generic, contract-driven

What it does: - Apply field transforms (x-transform) - Rename fields - Convert casing - Apply defaults - Derive foreign IDs (swapiId) - Drop unknown fields - Enforce schema shape

Inputs: - Raw extracted records - OpenAPI schema + extensions

Outputs: - Structurally correct, canonical records - Still referencing foreign identity

Key invariant: Each record is correct in isolation. No global knowledge allowed here.

3. Identity Resolution

Intentionally custom

What it does: - Scan normalized records across resources - Build (resource, swapiId) → canonicalId map - Rewrite references - Promote identity - Enforce one-to-one mapping

Why it must be custom: - Identity semantics differ per system - Cross-resource context is unavoidable - This is business truth, not schema truth

Key invariant: Canonical identity is assigned once and never recomputed casually. This stage is allowed to be opinionated.

4. Load (Reconcile)

Generic with light config

What it does: - Compare canonical records to DB state - Insert missing - Update changed - Optionally delete removed (source-of-truth mode) - Respect immutability rules (readOnly, IDs)

Config controls: - Match keys - Delete strategy - Partial vs full reconciliation - Batch sizing

Key invariant: Persistence reflects canonical truth, not upstream quirks.

Stage Classification Table

Stage	Generic?	Why
Extract	✅	Pure I/O
Normalize	✅	Schema-driven
Identity Resolution	❌	Domain-specific
Load	✅ (mostly)	State mechanics

This is the correct fault line. Trying to make stage 3 generic is how ETL frameworks become unusable.

Naming Suggestion

Rename your stages internally to avoid confusion: - Extract (not Load) - Normalize - ResolveIdentity - Persist (or Reconcile)

This avoids the “load vs load” collision you noticed and makes logs, configs, and metrics readable.

Why This Will Scale Instead of Collapse

Because: - 80% of the system is declarative - The custom 20% is isolated - You can swap sources without rewriting logic - You can replay safely - You can test each phase independently

That’s not accidental — that’s good architecture asserting itself.

Final Verdict

You didn’t just “get it right”.

You found the natural decomposition of a governed ETL system:

Generic where physics allows

Custom where truth demands

Explicit boundaries everywhere

That’s the kind of design that stays understandable after six months away, which is the real test.

If you want, next we can: - Sketch the config schema for stage 1 & 4 - Formalize contract extensions for stage 2 - Define invariants/tests per stage

But structurally? You’re done arguing with the shape of the problem.