AlignTrust
AI Threats & Attacks

Data Poisoning

An attack that corrupts an AI model's training data to manipulate its behaviour — causing misclassifications, backdoors, or degraded performance.

What Is Data Poisoning?

Data poisoning is an adversarial attack against machine learning systems in which an attacker manipulates the training data to corrupt a model's learned behaviour. By injecting malicious, mislabelled, or carefully crafted data into the training pipeline, an attacker can cause the model to behave incorrectly — misclassifying inputs, producing wrong outputs, or containing a "backdoor" that triggers specific malicious behaviour when activated.

Data poisoning is an integrity attack on the AI pipeline — rather than attacking the deployed model, the attack occurs earlier in the development process.

Types of Data Poisoning Attacks

Availability attacks: Degrade overall model performance by introducing large amounts of corrupted data. Goal: make the model useless.

Targeted attacks: Cause the model to misclassify specific inputs. Example: a spam filter trained on poisoned data classifies specific malicious emails as safe.

Backdoor attacks (trojan attacks): Insert a hidden trigger into the model during training. When the trigger pattern appears in inputs at deployment, the model behaves maliciously — but operates normally on clean inputs. The backdoor is nearly invisible without dedicated testing.

Where Data Poisoning Occurs

  • Training data collection: Web scraping, crowdsourced labelling, or open datasets can be targeted
  • Fine-tuning pipelines: Organisations fine-tuning pre-trained models with custom data
  • Federated learning: Distributed learning systems where participant nodes can contribute poisoned updates
  • RAG knowledge bases: Documents injected into retrieval systems to manipulate AI responses

Real-World Relevance

As organisations use AI for fraud detection, access control decisions, malware classification, and content moderation, the consequences of data poisoning extend beyond model accuracy to direct security outcomes.

Defences Against Data Poisoning

  • Training data provenance and integrity: Verify data sources and maintain checksums
  • Data sanitisation: Filter outliers and anomalous labels before training
  • Differential privacy: Limit the influence any single training sample can have
  • Adversarial training: Include adversarial examples in training to build robustness
  • Model monitoring: Detect behavioural drift in deployed models that might indicate poisoning
  • AI red teaming: Test models for backdoors and unexpected behaviours before deployment