Data Scraping – Navigating the Challenges Old and New (Part 1: Data Protection) - EM Law

October 4, 2023

Contract Law

Data Protection Law

Intellectual property

Software & Technology

Colin Lambertus and Neil Williamson were featured in OneTrust’s DataGuidance, writing an article on ‘data scraping’ – also known as web scraping. That article was a condensed version of a larger, two part series, that addressed the two key legal aspects of web scraping: data protection and intellectual property. Part 1 considers the data protection position.

Web scraping has become a focus for our clients in recent months. Whether it be in relation to the latest in AI developments, price comparison websites, and data republishers – businesses are ever more reliant on the leviathan of publicly available data on the internet. It is common knowledge that a great deal of this data is personal data. Personal data often has the most value to businesses: modern social media companies and search engines practically owe their existence to the re-use of personal data published by individual third parties. For this reason, the use of personal data is heavily regulated by the well known GDPR. We explore the key concerns that data controllers (and to an extent, processors) need to have in mind if they wish to utilise web scraping tools

Data Protection

The GDPR does not explicitly address the legality of web scraping.

Instead, its impact on web scraping is akin to the processing of any personal data collected by a data controller or processor through any other means.

The familiar GDPR rules therefore apply: the organization must have a lawful basis to process the personal data collected through web scraping, establishing the required technical and organizational safeguards for securing and managing said personal data, all while adhering to the data protection principles outlined in Article 5 of the GDPR.

However, does this imply that web scrapers can feel at ease? Not entirely. There are specific data protection challenges and, indeed, heightened risks that come into focus when an organization engages in web scraping.

Notification

Most will be familiar with Article 13 of the GDPR, which mandates the controller to inform data subjects of specific aspects of the processing of their personal data. This is the standard suite of information contained in most privacy policies linked to in website footers.

However, a crucial point, particularly relevant to web scraping, is that Article 13 of the GDPR applies when personal data has been collected directly from the data subject.

Where personal data is collected indirectly, as is the case with web scraping, Article 14 of the GDPR comes into play. Article 14 stipulates that the information outlined in Article 13 of the GDPR must be provided to the data subject at a later point. There are three scenarios:

within a reasonable period after obtaining the personal data, but not later than one month;
if the personal data is intended for communication with the data subject, at the latest during the first communication with that data subject; or
if disclosure to another recipient is planned, at the latest when the personal data is first disclosed.

Hence, the absolute deadline is within one month.

However, there are exemptions, as detailed in Article 14(5)(b) of the GDPR, in that the necessary information does not need to be provisioned if it “proves impossible or would involve a disproportionate effort.”

Organizations engaged in web scraping are therefore expected to comply with Article 14 of the GDPR. This may be achievable, but if web scrapers are processing the personal data of thousands of data subjects (or more), it becomes difficult.

For web scrapers collecting personal data, the most immediate task is to ensure that the necessary information about the organization’s data processing activities is published on its website or other publicly accessible format.

Notably, in 2019, the Polish data protection authority (PDPA) imposed a significant fine on a Swedish web scraper that had been extracting personal data from official sources, involving approximately 7.5 million data subjects. Among these, around 600,000 data subjects had available email addresses, while 200,000 mobile numbers were being processed, and only a postal address was available for the rest. The Swedish web scraper opted to contact only those data subjects with email addresses, citing the high cost (millions of euros) associated with contacting the others. Therefore, it invoked the ‘disproportionate effort’ exemption.

The PDPA disagreed, asserting that merely placing the necessary information on a company website is not enough to meet the Article 14 GDPR notification obligation. Additionally, reaching out via telephone or postal mail to the remaining of the data subjects did not constitute a disproportionate effort even though it would have cost millions of euros to do so. Consequently, a fine of €220,000 was issued. This decision was appealed, and while the fine was recalculated the PDPA’s decision-making process was upheld.

This decision may not represent the stance of other regulators in the UK/EU, but it provides a useful indicator of the potential regulatory response if a web scraper was brought before a regulatory authority.

If a web scraper wished to rely on the impossibility or disproportionate effort exemption, the ICO has made it clear that the scraper will need to make a documented assessment of its reasoning.

The ICO has provided helpful guidance in this respect:

Impossibility: this exemption is not easily invoked. It will likely only apply if the organization lacks any contact details of the data subject and has “no reasonable means to obtain them.” This point is highly important. An organization cannot rely on the impossibility exemption if it simply takes no action to ascertain whether a data subject’s details could be obtained.
Disproportionate effort: this exemption involves a balancing exercise. It requires weighing the effort required to contact individuals against the potential ‘effect’ that the processing will have on them. Therefore web scraping activities that are non-intrusive and of a light-touch nature might make it easier for organizations to justify their use of this exemption.

Invisible processing and DPIAs

The practice of web scraping without notifying data subjects is a form of ‘invisible processing.’

In the UK, following Article 35(4) of the GDPR, the ICO is required to publish a list of processing methods that will require a Data Protection Impact Assessment (DPIA). A DPIA is a rigorous analysis carried out by the controller to assess the potential harm to data subjects, and the ways in which an organization will mitigate that harm to an appropriate level.

Where an organization is relying on the impossibility or disproportionate effort exemption for its web scraping activities and these activities also involve one of the indicators of the ‘high-risk’ indicators outlined by the Article 29 Working Party (e.g., large-scale processing, monitoring, automated decision-making, etc.), the organization must carry out a DPIA. However, if invisible processing is not combined with high-risk activity, the ICO’s guidance is that a DPIA should still be performed.

It is important to consider jurisdictional requirements in this context. The Irish data protection regulator’s Article 35(4) of the GDPR list includes a reference to invisible processing, but it does not necessitate the combination of invisible processing with a high-risk factor to mandate the organization to conduct a DPIA.

Therefore, for a web scraper, it is typically considered a good practice to carry out a DPIA. A DPIA will assist an organization in demonstrating that the processing is fair, and the analysis of the ‘impossibility’ or ‘disproportionate effort’ exemption can be seamlessly incorporated into any DPIA.

Data minimization

Article 5 of the GDPR sets out the key data protection principles that organizations must adhere to in all their processing activities. One of the most obvious requirements for web scrapers is Article 5(1)(c) of the GDPR, which states that the processing of personal data should be ‘adequate, relevant, and limited to what is necessary in relation to the purposes for which they are processed.’

Web scraping inherently involves large-scale data collection. As a result, personal data may be collected incidentally. Therefore, organizations must consistently ensure that any personal data collected through web scraping is directly applicable to the purpose for which it is collected. General assumptions that all potential personal data related to the purpose must be scraped will not align with this principle. Likewise, the unintentional collection of personal data would not be compliant if it serves no purpose and is simply gathered because the tool harvests everything on a webpage.

If you have any questions about web scraping, please feel free to contact Neil or Colin directly – or via EM Law’s website here.

Data scraping – navigating the challenges old and new