The EU AI Act, which entered into force in August 2024, places explicit obligations on providers of high-risk AI systems to maintain documentation of training data sources, labeling methodologies, and quality control processes. This is not an abstract compliance checkbox: auditors will ask to see annotation guidelines, inter-annotator agreement scores, and evidence of bias testing across protected characteristics. Organisations that have treated data provenance as an afterthought are discovering that retrofitting documentation onto existing datasets is far more expensive than building it in from the start.
NIST's AI Risk Management Framework (AI RMF 1.0) takes a complementary approach, framing data governance as a risk management discipline rather than a compliance exercise. The GOVERN and MAP functions of the framework ask organisations to identify data-related risks — label bias, coverage gaps, representational harm — before a model enters production, not after. Practically, this means embedding data quality gates into your MLOps pipeline: automated bias checks, distribution coverage reports, and lineage tracking that can answer the question "where did this label come from?" for any sample in your corpus.
For teams working with sensitive domains — healthcare, finance, legal — the regulatory landscape adds a further layer: data minimisation, access controls on PII-adjacent annotation data, and in some jurisdictions, the right of individuals to request removal of their data from training sets. These requirements interact with ML workflows in non-trivial ways and are reshaping how annotation service providers structure their data handling agreements. Building governance infrastructure now, ahead of enforcement timelines, is the lowest-cost path through this transition.