Operations

ML Model Operations (MLOps)

Advanced

A machine-learning model is software that also depends on data and gives probabilistic answers. So it needs everything normal software needs (versioning, testing, CI/CD, monitoring) plus its own concerns: data and model versioning, evaluation, drift detection, and rollback. This is the engineering of getting models into production and keeping them healthy, the operational counterpart to High-Risk AI's governance.

High-Risk AI covers the legal and oversight duties for automated decisions. This covers the engineering: how we train, version, deploy, monitor, and retire models reliably. The key difference from ordinary code is that a model's behaviour depends on data and can get worse over time even if the code never changes (drift). So monitoring and re-evaluation are not optional.

For our risk-scoring and AML use, this connects tightly to fail-closed behaviour, human oversight, and evidence. A model's output must be validated, explainable, and never the only gate on a regulated decision (see High-Risk AI, AML Screening, Designing for Failure).

Version and reproduce

AlwaysVersion the model, its training data, and the code and config together, so any prediction can be traced to exactly what produced it (essential for audit and rollback) (see Audit Trails).
DoMake training reproducible (pinned data snapshots, seeds, environment) so a model can be rebuilt and a result explained.
DoTreat model artifacts and pipelines as code: in source control, built and deployed through CI/CD, never hand-deployed (see CI/CD & Deployment).
DoGovern training data for quality, representativeness, residency, and privacy. It is personal and regulated data subject to all our rules (see Data Protection & Privacy, Cross-Border Data Transfers).

Evaluate, deploy, monitor

DoEvaluate before deploying (accuracy, robustness, and bias/fairness on held-out data) and record the results as evidence (see High-Risk AI).
DoRoll models out gradually (shadow or canary) and keep a fast rollback to the previous version (see Deployment Strategies).
AlwaysMonitor models in production for drift, degradation, and unexpected outputs, with alerting. A model can silently get worse as the world changes (see Security Monitoring & Detection, SLOs).
DoRe-evaluate and retrain on a defined schedule and on drift signals. Treat a degraded model as an incident path (see Incident Readiness).

Decisions, safety & oversight

DoMake outputs explainable and log the inputs, model version, and result for each decision, so it can be reconstructed and contested (see High-Risk AI, Auditability & Evidence).
AlwaysFail closed on low-confidence, missing, or errored model output: block and escalate to a human, never auto-approve (see Designing for Failure, AML Screening).
DoKeep meaningful human oversight and an override path for decisions that materially affect people. Treat model output as advice, not authority (see AI & LLM Feature Security).
NeverLet a model be the only, unchecked gate on a regulated decision, or deploy one without the evaluation, monitoring, and oversight controls in place.

Self-review checklist

AskCan I reproduce and trace any prediction to a specific model, data, and code version?
AskWas the model evaluated for accuracy and bias, and is it monitored for drift in production?
AskOn low-confidence or errored output, does it fail closed to human review?
AskIs there human oversight, explainability, and an audit record for decisions that affect people?

Why it matters: Models fail differently from ordinary code. They get worse silently as data shifts, can encode bias, and make decisions about real people. So they need versioning, evaluation, monitoring, and human oversight as first-class engineering. For our regulated risk and AML use, disciplined MLOps is what makes automated intelligence safe, explainable, and defensible rather than an unaccountable black box.