Introduction

One of our key clients is a leading data analytics company – we’ve been providing their team with legal advice since 2022. 

We’ve detailed in another case study the work we did to update the client’s existing “data licence” to accommodate changes to its business processes. 

One of the key aspects of that update was to ensure that the client’s data was protected from AI training by anyone else other than the client itself. 

Context and challenge

It is common knowledge that AI models require a significant amount of data to operate effectively. This applies not only to the foundational models made available by OpenAI, xAI, Meta or Anthropic, but downstream tools operated by organisations with their own “in-house” AI products with specialised versions of these larger models. Any organisation can purchase or develop these specialised AI products for narrower purposes. 

These narrow tools still require data. The more data they have, the better they are. 

This applies to the “training” process – where an AI model is provided with information. That information is used by the model to make predictions about what should come next, even when presented with new information by a new user. This training process could build an AI model from scratch. 

At a higher level, AI models that have already been trained can be “fine-tuned” to further enhance their ability to make predictions in narrow scenarios. For example, a chatbot that knows about what products you have in stock will be able to better communicate that to potential customers than a model that just had general information about what a business such as yours may have in stock. 

Training, and, to a lesser extent, fine tuning, requires that the dataset forms part of the model itself so that the AI can retain what it has learnt. By retaining the data, it can be used in variety of different ways by the developer of the AI model and precise control over that data is lost. A good example of this is how an AI model can be reverse engineered by a third party to reproduce the data that it has learnt from – a clear confidentiality concern. 

Our client was concerned that by permitting (or, at least, being silent about) AI training or fine tuning it would lose control over its data. Customers could access its data, use an AI to learn from it, and that AI could repeat what it has learned to third parties. 

Process and insight

A key mechanism to ensure that the client’s data remains protected is simply to use contractual restrictions. If the client’s customers used the client’s data in a way that was not permitted by the data licence, the client would have a claim for breach of contract that could be enforced by the courts. 

We proceeded to draft the relevant restrictions. However, we did not want to be too heavy handed – our client was well aware that its customers used AI tools to interrogate the data it licensed. Our client thought it would be uncommercial to prevent that entirely (which would be easy to draft for). 

As such, we had to develop contractual clauses that would permit the “analysis” of our client’s data but not the “ingestion” of it by AI tools. The latter is a technical term for the process by which information is absorbed into the machine learning aspects of an AI model. Once data has been ingested it becomes part of the model itself. Mere “analysis” however, does not require ingestion – AI systems can be constructed in such a way that they can use data but not learn from it.

As part of this, it was important to cater for AI techniques that, arguably, blend this distinction. For example, the use of ‘retrieval-augmented generation’ techniques does go slightly further than what would traditionally be considered as analysis (e.g. as a form of fine tuning). But these retrieval-augmented generation techniques do not necessarily require the permanent retention of data. There are also other methods of training (such as the training of a core model on federated instances) that do not necessarily require access to the underlying raw data and cannot be reverse engineered. 

Result

We ended up having to amend the client’s documentation to accommodate these various points and to cater for a wide range of scenarios. Despite the technical complexity, most of the drafting is straightforward and most businesses, including our client, can easily rely on the distinction between ingestion and analysis.  

Implementing the necessary contractual restrictions, now and in the future, will be a key part of protecting data heavy organisations from the (mis)use of AI systems.

If you are collecting or receiving a significant amount of data and you have questions around how best to protect yourself, please do not hesitate to contact us.