Data Pipelines & ETL
Moving and transforming data in bulk, into a warehouse, between systems, for reporting, or to feed a model, is its own discipline. Pipelines must be idempotent and re-runnable, validate data quality, handle partial failure, preserve tenancy and privacy, and be observable. A silent pipeline failure corrupts everything downstream that trusts its output.
ETL and ELT pipelines extract data, transform it, and load it somewhere. They often run on a schedule, in large volumes, and feed decisions or reports. They are dangerous because they run unattended, so silent failures rot downstream data; they touch a lot of often-personal data at once; and they are easy to make non-idempotent, so a re-run loads the data twice. Treat a pipeline as a reliable, observable, re-runnable system, not a script.
This connects Background Jobs (how it runs), Data Integrity and Distributed Systems (idempotency, consistency), Privacy and Residency (it moves regulated data), and Observability.
Make pipelines reliable
- AlwaysMake pipelines idempotent and re-runnable. A re-run (after a failure or retry) must not load the data twice or corrupt it. Use upserts, keys, or watermarks, not blind appends (see Data Integrity & Transactions).
- DoProcess in bounded batches with checkpoints, so a large run can resume instead of restarting, and does not exhaust memory or overload sources (see Performance & Resource Use).
- DoHandle partial failure on purpose. Isolate and dead-letter bad records instead of failing the whole run or silently dropping them (see Asynchronous Messaging).
- DoRun pipelines through durable scheduling and orchestration, not fire-and-forget scripts (see Background Jobs & Scheduled Work).
Validate quality and observe
- DoValidate data quality at ingestion and after transforms (schema, ranges, row counts, referential checks). Bad input silently becomes bad output that everyone trusts.
- AlwaysMake pipelines observable. Log runs, record row counts and lineage, and alert on a failure or a run that did not happen. A silent missed or failed pipeline is the dangerous case (see Observability & Logging Hygiene).
- DoTrack data lineage (where it came from, what transformed it) for debugging and for audit and regulatory traceability (see Auditability & Evidence).
- DoReconcile against the source where correctness matters (counts, totals), so divergence is caught (see Distributed Systems & Consistency).
Privacy, tenancy & security
- AlwaysPreserve tenancy and data classification through the pipeline. Do not merge tenants' data incorrectly, and protect the data in transit and at the destination as strongly as at source (see Multi-Tenancy, Data Classification).
- DoMinimise and mask. Only move the data you need, and pseudonymise or mask it where the destination does not need identities (analytics) (see Data Masking & Redaction, Product Analytics Privacy).
- DoRespect residency. Know where the sources, compute, and destinations are, and do not export regulated data across borders without approval (see Cross-Border Data Transfers).
- NeverCopy production personal or KYC data into a non-production or lower-trust destination through a pipeline without masking and authorisation (see Test Data & Environments).
Self-review checklist
- AskIf this pipeline runs twice or resumes after a failure, is the result still correct (idempotent)?
- AskIs data quality validated, and would a failed or missed run actually alert someone?
- AskIs tenancy and classification preserved, data minimised and masked, and residency respected?
- AskCan I trace lineage and reconcile against the source?