Key risk areas: Supply chain, poisoning and drift
To tackle that challenge, the CSI pinpoints three critical moments when data is most exposed—the data supply chain, the risk of poisoning and undetected drift over time.
Data supply chain
Large-scale, third-party datasets can contain errors or backdoors introduced unwittingly or maliciously. Unvalidated training data not only corrupts the immediate model, but also “any additional models that rely on [it] as a foundation.”
To mitigate this, organizations should implement robust verification before ingesting any new data (e.g. checksums or digital signatures) and track data provenance through content credentials or metadata that attest to the source and integrity of each dataset. Data should be certified “free of malicious or inaccurate material” before use, and kept in append-only, signed stores after ingestion.
Maliciously modified (“poisoned”) training data
Adversaries may attempt to inject subtle corruptions or fake records into training pipelines. The CSI calls for continuous vetting of training sets: remove or flag any suspicious or anomalous entries, and cryptographically sign datasets at ingestion to detect tampering.
Organizations should require their data and model providers to formally certify that their inputs contain no known compromises. Data consumers and curators must maintain end-to-end integrity, from signed collection and secure storage to real-time monitoring of network and user activity for unexpected changes.
Data drift
Over time, the statistical properties of input data can change (“drift”), reducing model accuracy. This degradation is natural but must be distinguished from attacks. The CSI notes that gradual shifts typically indicate normal drift, while abrupt changes can signal poisoning.
Organizations should continuously monitor AI inputs and outputs, comparing incoming data distributions to the training baselines. Data management processes—regular retraining with fresh data, cleansing and ensemble models—help keep models calibrated. In high-stakes environments (e.g. healthcare), even small drifts matter, so “continuous monitoring of model performance with additional analysis of the input data” is important.