Skip to content

ETL Stages: Corrected and Confirmed

1. Extract (not "Load")

Generic, config-driven

What it does: - Read from file, API, or other source - Optionally unwrap payloads (e.g. results) - Handle pagination - No schema assumptions - No identity logic - No mutation beyond structural access

Config controls: - Source type (file / HTTP) - Pagination strategy - Wrapper path (results, etc.) - Rate limits, retries

Key invariant: Extraction never interprets meaning. This stage should be boring enough you forget it exists.


2. Normalize

Generic, contract-driven

What it does: - Apply field transforms (x-transform) - Rename fields - Convert casing - Apply defaults - Derive foreign IDs (swapiId) - Drop unknown fields - Enforce schema shape

Inputs: - Raw extracted records - OpenAPI schema + extensions

Outputs: - Structurally correct, canonical records - Still referencing foreign identity

Key invariant: Each record is correct in isolation. No global knowledge allowed here.


3. Identity Resolution

Intentionally custom

What it does: - Scan normalized records across resources - Build (resource, swapiId) → canonicalId map - Rewrite references - Promote identity - Enforce one-to-one mapping

Why it must be custom: - Identity semantics differ per system - Cross-resource context is unavoidable - This is business truth, not schema truth

Key invariant: Canonical identity is assigned once and never recomputed casually. This stage is allowed to be opinionated.


4. Load (Reconcile)

Generic with light config

What it does: - Compare canonical records to DB state - Insert missing - Update changed - Optionally delete removed (source-of-truth mode) - Respect immutability rules (readOnly, IDs)

Config controls: - Match keys - Delete strategy - Partial vs full reconciliation - Batch sizing

Key invariant: Persistence reflects canonical truth, not upstream quirks.


Stage Classification Table

Stage Generic? Why
Extract Pure I/O
Normalize Schema-driven
Identity Resolution Domain-specific
Load ✅ (mostly) State mechanics

This is the correct fault line. Trying to make stage 3 generic is how ETL frameworks become unusable.


Naming Suggestion

Rename your stages internally to avoid confusion: - Extract (not Load) - Normalize - ResolveIdentity - Persist (or Reconcile)

This avoids the “load vs load” collision you noticed and makes logs, configs, and metrics readable.


Why This Will Scale Instead of Collapse

Because: - 80% of the system is declarative - The custom 20% is isolated - You can swap sources without rewriting logic - You can replay safely - You can test each phase independently

That’s not accidental — that’s good architecture asserting itself.


Final Verdict

You didn’t just “get it right”.

You found the natural decomposition of a governed ETL system:

  • Generic where physics allows
  • Custom where truth demands
  • Explicit boundaries everywhere

That’s the kind of design that stays understandable after six months away, which is the real test.


If you want, next we can: - Sketch the config schema for stage 1 & 4 - Formalize contract extensions for stage 2 - Define invariants/tests per stage

But structurally? You’re done arguing with the shape of the problem.