The Legality of Personal Data Scraping to Train AI Models

May 29, 2026

AI Law

Data Protection Law

Software & Technology

Summary

Current ICO position: The ICO believes that AI companies scraping personal data to train generative AI is ‘high-risk’ and ‘invisible’, and must comply with UK GDPR.

The Only Lawful Basis: Legitimate interests is the only viable legal basis. Others (like consent or contract) are impossible to achieve because developers have no direct relationship with the internet users.

The Three-Part Test to qualify for Legitimate interests:

Purpose: Developers must define a specific ‘legitimate interest’ for scraping personal data; vague goals like “advancing technology” are not enough.
Necessity: Developers must prove why scraping personal data is necessary to fulfil their legitimate interest, and why less intrusive options—like licensed, synthetic, or anonymized data—will not work.
Balancing: Developers must weigh their goals against individual rights. This requires strict transparency, data filtering, and active opt-out mechanisms.

Developers are expected to:

justify why alternatives won’t work;
minimise data collection;
filter sensitive data; and
maintain Legitimate Interests Assessments (LIAs).

AI developers must consider various legal issues in the development and improvement of their AI models. In this blog, we will focus on the Information Commissioner Office’s (“ICO”) position on training AI models through scraping personal data from the web from the legal and regulatory perspective of the UK GDPR.

Current ICO position on the legality of scraping personal data to train generative AI models

The ICO held a five-part consultation into this matter in 2024 culminating in a report published at the end of the year, and its position has seemingly remained unchanged since the initial consultation.

Whilst the ICO has acknowledged the potential economic and societal benefits of generative AI, it has been sceptical of indiscriminate scraping practices, particularly where individuals are unaware that their personal data is being harvested and repurposed for model training.

Under the UK GDPR, organisations must identify a lawful basis before processing personal data, and, according to the ICO, a data controller’s “legitimate interests” is likely to be the only lawful basis which AI developers can use when processing personal data obtained via web scraping practices.

Why legitimate interests is the only available lawful basis for scraping personal data to train AI models

There are several possible lawful bases, but the ICO considers most of them to be unavailable in the context of generative AI data scraping.

Consent

The central problem with consent is that AI developers almost certainly have no direct relationship with the individuals whose data they collect. The lawful basis of consent requires that the consent is ‘specific and informed’. To obtain “specific” consent, individuals need to know to whom their consent is given. It must also be informed – individuals must know that, prior to the data being used, what it will be used for. These criteria will be impossible to fulfil when, for example, individuals have no idea that their personal data will be scraped from the internet and used by an AI developer.

Performance of a contract

The “performance of a contract” basis is similarly unavailable because the AI developer generally has no contract with the individual whose data is scraped.

Similarly, the ICO considers that it will be uncommon that AI training (as opposed to just the mere use of an AI system) will be required to perform a contract with an individual.

Legal obligation

Developers cannot rely on compliance with a legal obligation because there is currently no law requiring companies to scrape personal data to train AI systems.

Vital interests

The “vital interests” basis applies primarily where processing is necessary to protect someone’s life. Training commercial generative AI models does not fall within this category.

Public task

Public task is generally limited to organisations exercising official authority or carrying out functions in the public interest. The ICO therefore considers it highly unlikely to apply to commercial AI developers.

Legitimate interests and the three-part test

This leaves legitimate interests as the only realistic lawful basis.

Legitimate Interests is regarded as the most flexible lawful basis for processing personal data. However, relying on legitimate interests brings with it the UK GDPR requirement that the data controller can satisfy a three-part test.

Three tick-boxes with one box ticked in an EM Law blog on the legality of personal data scraping to train AI models

1. Identify a legitimate interest (the “Purpose Test”)

The first part requires the organisation to identify a legitimate interest.

Legitimate interests can include commercial objectives, individual interests, or broader societal benefits.

The ICO suggests organisations should ask questions such as:

Why is the personal data being used?
What specific objective is being pursued?
Who benefits from the processing?
Are there wider public benefits?
How important are those benefits?
What would happen if the processing could not proceed?
Would the proposed use be unethical or unlawful?

2. Show that the use of personal information is necessary to achieve the legitimate interest (the “Necessity Test”)

The second part asks whether the processing is necessary to achieve the identified interest.

Most importantly, if the organisation could reasonably achieve the same result through a less intrusive method, legitimate interests will not apply.

The ICO recommends considering:

Whether the processing genuinely furthers the stated interest;
Whether the approach is reasonable; and
Whether a less intrusive alternative exists.

3. Balance it against the interests, rights and freedoms of the person whose information you want to use (the “Balancing Test”)

The final part requires organisations to balance the processing of the data against the interests, rights and freedoms of the individual whose information you want to use.

The ICO states that individuals’ interests are more likely to override the controller’s interests where:

They would not reasonably expect you to use the information in that way; or
The proposed use of the information would cause unjustified harm.

Organisations must therefore consider people’s reasonable expectations, but they should also consider the following:

The nature of their relationship with individuals;
Whether any particularly sensitive or private information is involved;
How comfortable they are explaining the use of the information to the affected individual in question;
The intrusiveness of the processing, and/or the likelihood of the affected individuals objecting to the practice;
The potential impact on the affected individuals (including personal, professional, commercial, financial etc.);
Whether children’s information is being used;
Whether vulnerable individuals are affected, or if the affected individuals are at increased risk of harm in any way;
Whether there are any safeguards available; and
Whether less personal information can be collected to achieve the legitimate interest, or if certain individuals can opt out.

Transparency

The ICO also emphasises transparency, and lack of transparency can allow regulators to conclude that the Balancing Test has not been passed.

Organisations therefore need to:

Keep on file a Legitimate Interests Assessment (“LIA”), wherein the data processor details why they satisfy each aspect of the three-part test, reviewing it as processing practices evolve; and
Include the details of your legitimate interests in your publicly available privacy information.

Applying the three-part test to personal data scraping to train AI models

The ICO’s 2024 report provides a useful indication of how it expects the Legitimate Interests test to operate in practice for generative AI developers.

Purpose Test in practice

The ICO accepts that generative AI development can involve Legitimate Interests. However, it has drawn a clear distinction between specific and vague objectives.

Developers cannot simply rely on broad statements such as bettering society through innovation, or advancing AI technology. Instead, they must identify precisely what the model is intended to achieve and why the categories of personal data being collected are relevant to that objective.

The regulator’s approach reflects a broader principle of data minimisation: organisations should only collect personal data that is genuinely relevant to a clearly articulated purpose.

Necessity Test in practice

From the ICO’s perspective, the Necessity Test is likely to become increasingly difficult to meet.

The ICO specifically highlighted alternative methods such as licensed datasets from publishers who collect personal data transparently. This means developers may now need to show:

Why licensed datasets are insufficient;
Why synthetic or anonymised alternatives cannot achieve the same result;
Why narrower or curated datasets are inadequate; and
Why scraping particular categories of personal data is necessary.

This does not necessarily prohibit scraping, but it raises the evidential burden substantially.

The larger and more indiscriminate the scraping exercise, the harder it may become to demonstrate necessity.

Balancing Test in practice

The balancing test may ultimately prove the greatest obstacle for developers.

The ICO describes generative AI scraping as “high-risk” and “invisible” processing. Individuals are often unaware their information has been collected, may have no realistic opportunity to object, and may struggle to exercise rights such as erasure once data has been incorporated into model training.

Transparency therefore becomes critical. According to the ICO, insufficient transparency measures may make it difficult for developers to pass the balancing test at all.

Some consultation respondents argued that downstream licensing agreements and terms of use for open-access models could mitigate risks by restricting harmful deployment. The ICO acknowledged this possibility but imposed an important qualification: developers relying on contractual safeguards must demonstrate that those arrangements contain meaningful data protection requirements and that compliance is effectively monitored in practice.

In other words, safeguards cannot merely exist on paper. Developers must be able to show they operate effectively in reality.

The ICO also expects organisations to consider additional mitigations, such as:

Collecting less personal data;
Filtering sensitive information;
Providing opt-out mechanisms;
Improving transparency notices; and
Conducting ongoing reviews of risk and proportionality.

Conclusion

Legitimate interests remains the only realistic lawful basis for using web-scraped personal data to train generative AI models, but meeting the relevant obligations under the UK GDPR can lead to complex matters.

Developers are expected to justify why alternative data collection methods are unsuitable and to implement meaningful safeguards that genuinely protect individuals’ rights.

Ultimately, the ICO’s position is that generative AI innovation does not sit outside existing data protection law.

Developers may still be able to rely on Legitimate Interests, but only where they can demonstrate that their processing is proportionate, transparent, and fair. With the possibility that regulations will only become harsher, it is imperative that AI developers rigorously document, file, and update their LIAs to remain compliant. EM Law are experts in AI and data protection. If you need help navigating AI or data protection, please contact us here or visit our AI Lawyers, Software & Tech Lawyers or Data Protection Lawyers pages for more information.

Contact us

The Legality of Personal Data Scraping to Train AI Models

Summary

Current ICO position on the legality of scraping personal data to train generative AI models