The GDPR and GenAI — Part 1: Lawful Bases
First part of my legal analysis of the application of the GDPR to GenAI, and LLMs in particular
The recent official publication of the EU Artificial Intelligence Act has left several fundamental debates on the legal status of AI in Europe unresolved. Two significant potential legal obstacles to AI development and use remain inadequately addressed: intellectual property and personal data protection laws. My goal in this series of short publications is to focus on the latter, specifically examining the application of the EU General Data Protection Regulation (GDPR) to the development and use of large language models (LLMs). This topic is critically important, as certain proposed interpretations of the GDPR could severely restrict or even lead to a de facto prohibition of much AI activity within the EU. Moreover, the European Data Protection Board (EDPB) is currently drafting an opinion on the application of the GDPR to AI models, which is expected to be published in late December 2024.
In this series, I will explore three key issues:
Legal basis for data processing (the subject of this paper): the appropriate legal basis for processing personal data in LLM development and use under Article 6 of the GDPR.
Scope of GDPR application to LLMs: are there any situations in LLM development and use that do not involve the processing of personal data and thus fall outside the scope of GDPR; in particular, do model weights themselves contain personal data?
GDPR compliance for LLMs: how LLM developers and users can comply with specific GDPR requirements regarding data accuracy, rectification, and erasure.
These issues are likely to be covered in the upcoming EDPB opinion, as suggested by the questions posed by the EDPB during the stakeholder event on 5 November 2024, which I covered elsewhere.
Legal basis for data processing
Under the GDPR, any processing of personal data must be justified under one of the “lawful bases” defined in Article 6. Those include, among others, the consent of the person whose data is being processed (“the data subject”), contractual necessity, legal obligation, and legitimate interest. Given that “the processing of personal data” tends to be interpreted very broadly, there is a lively debate about which lawful bases could apply to each stage of AI development and use. In principle, different lawful bases could apply to different stages (e.g. data collection, model training, model deployment) and to different uses (e.g. to a public chatbot service and to a model used only for a legally required task like anti-money laundering).
For the model development stage, the legitimate interest basis seems to be preferred by both the authorities and the business community. Among the privacy authorities, the French CNIL, the Baden-Württemberg LfDI, and even the European Data Protection Board in its ChatGPT Taskforce Report, all suggested that the legitimate interest basis may be applicable, though not without a host of qualifications.
At the deployment stage, two kinds of personal data may be relevant. The first is the data of the users of a service, e.g., of an AI chatbot. Those users may input their own personal data while interacting with the service. If the service provider processes the data for its purposes, then it may in some situations, be able to rely on contractual necessity, not only on legitimate interest. However, to the extent that personal data of third parties may be involved, e.g. in model outputs delivered through the chat interface, then similar limitations will likely apply as for training, and legitimate interest will likely be the most practical option.
Why not consent?
Consent under the GDPR is an opt-in mechanism that must be a “freely given, specific, informed and unambiguous indication of the data subject’s wishes.” Data subjects have a right to withdraw consent at any time in a way that is as easy as giving consent.
The French and Baden-Württemberg authorities rightly point out that it is most likely infeasible to ask for the informed, specific consent of all people whose personal data may be included in the training data for an LLM. LLMs are often trained using data scraped (crawled) from the Internet. This data is gathered by the model creators or sourced from providers like CommonCrawl. Even when LLM developers can identify instances of personal data in the dataset (e.g., by looking for common patterns like popular given names or surnames), they are unlikely to have the means to contact the people in question. Or even to determine whether the information refers to real people and not fictional characters. Moreover, there will almost inevitably be instances of personal data that cannot be feasibly automatically identified in training data. Some may argue that AI developers could adjust their data collection and training datasets in such a way as to make it easier to identify which data relates to which natural persons. But this would significantly increase the extent of personal data processing and associated risks, interfering with the principle of data minimization.
What about the likely rare cases where the AI developer does know whose data they are processing and has the means to contact them? It may be tempting to say that, in such a case, the developer should rely on consent. The problem is that this would be practically infeasible for a different reason. As the Baden-Württemberg authority noted, one issue is with the right of a data subject to withdraw consent at any time. The developer may be able to remove the individual’s data from the training dataset, but what about a model already trained on the data? Now, I want to flag here that there is a separate controversy on whether model weights store personal data, even if it is possible to prompt the model to output what appears as personal data. I will return to this question in the next text in this series.
The second problem noted by the Baden-Württemberg authority is that consent is interpreted to come with a very high threshold of transparency about what the individual is consenting to. Depending on (controversial) GDPR interpretations, this threshold could be so high that it is unachievable for most people - they will simply be unable to understand how LLM development and deployment works (not just because mathematics is hard, but also because of the inherent opacity of LLMs even for experts).
Even if it is not practically infeasible for AI developers to ask for consent, it does not follow that they are legally required to rely on this GDPR basis. On a simplistic, “folk” version of the GDPR, the core of that law is that it requires consent to process personal data. This is incorrect. There is no basis for this claim in the GDPR. What the GDPR does require is that at least one of the six lawful bases from Article 6 must apply. Consent does not have primacy over other bases, including legitimate interest.
Legitimate interest
The French CNIL has suggested that “more often than not creating a training dataset whose use is lawful can be considered legitimate.” However, this general statement should be interpreted cautiously, as there are conditions for using the legitimate interest basis that could be applied stringently. The key conditions include that (1) the data processing is necessary for the legitimate interest and (2) the legitimate interest is not overridden by “the interests or fundamental rights and freedoms of the data subject.” Hence, to determine whether they can rely on legitimate interests, an AI developer or deployer needs to apply a balancing test between the legitimate interest and the rights and interests of those whose data would potentially be processed.
CNIL elaborates that “an analysis is necessary to determine whether the use of personal data for this purpose” disproportionately infringes on data subjects' privacy, “even when the data is not nominative.” To ensure proportionate processing, CNIL recommends data pseudonymization, exclusion of sensitive data, and defining selection criteria to limit collection to relevant and necessary data.
When an organization relies on the legitimate interest basis, the GDPR provides concerned persons with a right to object to such data processing of their data. This right is not absolute: the data controller may refuse to stop processing the objecting person’s data. One key reason for such refusal is when the data controller can demonstrate “compelling legitimate grounds for the processing which override the interests, rights and freedoms of the data subject.” In other words, when someone exercises their right to object, this triggers a second kind of a rights/interests balancing test.
The legitimate interest basis had been one of the main subjects of the EDPB stakeholder event on 5 November. As I described in my coverage of the event:
I emphasised that we must approach the question of balancing in the context of what legitimate interests controllers can rely upon. I stressed that the best interpretation of the GDPR would be one that fully aligns with the Charter of Fundamental Rights, taking into account not only privacy and data-protection rights, but also freedom of expression and information, among others. Drawing parallels with case law, I pointed to how the EU Court of Justice has approached internet search engines, both in Google Spain and more recent cases. Controllers should be able to rely not only on commercial interests, but also on considerations similar to those discussed by Advocate General Niilo Jääskinen in Google Spain regarding search engines—specifically, regarding how AI-based services facilitate Europeans’ freedom of expression and information. There is a compelling case that AI tools are not only already important for Europeans, but are likely to become even more pivotal than search engines. Any GDPR interpretation that fails to account for this would be incompatible with the charter.
Another facet discussed at the stakeholder event was the question of “first-party” and “third-party” data. We usually talk of “first-party” data when service providers have direct relationships with individuals whose data they want to use (e.g., for model training). As already mentioned, the GDPR does not require relying on consent for first-party data. Legitimate interest may not only be available but, in fact, may fit the situation of a direct relationship between developers and data subjects very well. Thanks to such a direct relationship, it may be easier to facilitate the data subject’s right to object (e.g., the data subject may have a web account that could be used to facilitate right-to-object requests). Moreover, this may make it easier to inform the data subjects about the processing and about their right to object.
As I noted in my coverage of the EDPB stakeholder event, some contributors
… suggested that, because it may be possible for AI developers to ask users for prior consent in first-party contexts, that should mean that those developers must rely on consent, not legitimate interest. But this seems pretty clearly an attempt to smuggle in the priority of consent over other legal bases in Article 6 GDPR without any grounds in the GDPR. One should also note that we are talking about situations where a data subject would be considered sufficiently protected under a legitimate-interest basis in a third-party context, but would suddenly need consent in an otherwise identical first-party situation. Among other problems, this appears to be an unprincipled departure from equality before the law (among third-party and first-party AI developers).
Sensitive data, consent, and search engines
A potential obstacle is that the legitimate interest basis does not apply to “special categories” of personal data (e.g., data on race, ethnicity, health, philosophical beliefs), which by default require prior consent for processing. When processing large amounts of data, it may be impossible to avoid collecting information that could allow inferences about individuals’ special category data. An excessively strict application of GDPR rules on this issue could effectively require consent, conflicting with how competitive models are developed today.
CNIL acknowledges this challenge and suggests that if model developers take proper measures to minimize the risk of processing special category data, incidental and residual processing of sensitive data not intended for collection would not be considered illegal. However, what efforts will be required and whether authorities will resist setting an impractically high standard remains unclear.
One analogy is particularly helpful here: consider large search engines. They rely on crawling the Internet for data and on first-party data about how their users interact with the search results. They process the data using advanced statistics, likely involving machine learning. Finally, they display search results without being able to vet all the information for lawfulness or for the presence of sensitive personal data, even though the leading search providers certainly aim to do that.
The key for our discussion is that the EU Court of Justice decided not to apply EU privacy law in the strictest, or arguably even literal, way to search engines. This is what the Court’s Advocate General Szpunar has written in his opinion on which the Court implicitly relied in a later judgment - note that he refers to the GDPR’s predecessor (Directive 95/46):
Since Directive 95/46, which dates from 1995, and the obligations imposed in which are in principle addressed to the Member States, had not been drafted with search engines in their present form in mind, its provisions do not lend themselves to an intuitive and purely literal application to such search engines. (...)
It is therefore impossible to take an ‘all or nothing’ approach to the applicability of the provisions of Directive 95/46 to search engines. In my view, it is necessary to examine each provision from the aspect of whether it is capable of being applied to a search engine. (...)
A literal application of Article 8(1) of Directive 95/46 would require a search engine to ascertain that a list of results displayed following a search carried out on the basis of the name of a natural person does not contain any link to internet pages comprising data covered by that provision, and to do so ex ante and systematically, that is to say, even in the absence of a request for de-referencing from a data subject. (...)
To my mind, an ex ante systematic control is neither possible nor desirable.
As AG Szpunar advocated, in its 2019 decision in GC and Others v CNIL, the Court accepted that it is sufficient for search engine operators to de-reference personal data, including sensitive personal data, only after a request, in a kind of a “notice and take down” model.
Should a similar line of reasoning apply to LLMs? Search engines are certainly more established today than LLMs, and more people rely on them to access information. It is, therefore, much easier to argue that search engines are protected by fundamental rights recognised by EU law, like the freedom of expression and information. Those rights are on a higher level than the GDPR, which means that the GDPR can only restrict them to the extent that achieves a balance between the legally protected rights (including the right to privacy and the protection of personal data). In other words, there can be no legally correct interpretation of the GDPR that doesn’t fully account for other fundamental rights than privacy.
It may have been lucky for search engines that it took until 2014 for the EU Court to consider their compliance with EU privacy law in Google Spain. But imagine the courts asking questions about the role of search engines in exercising the freedom of information much earlier, e.g., in 1998. At that point in time, the technology was much less established, and it would have been much easier to dismiss it as a trivial novelty that simply “must comply” with the legal interpretations developed with older technologies in mind or otherwise be declared illegal. With the benefit of hindsight, I think most would find repugnant the idea that search engines could have been prohibited in their infancy.
Even though the use of generative AI is already broad and the technology is vital for many Europeans both expressively and to organize and access information, it is still very early. In search-engine terms, maybe we are in 1998, maybe even earlier.
It would be unprincipled and shortsighted to treat this technology as disposable instead of recognising not only its current role but also its clear potential to become a major tool for expression, access to information, but also for scientific research.
It would also be a mistake to condition the legality of the technology on uncertain, potential developments. For instance, one possible avenue of LLM development is that we may end up developing relatively small models that generalize better than the current largest frontier models. Such smaller models may exhibit very little perceived “memorisation,” but be excellent in reasoning or agentic behavior. Hence, they may be able to avoid some of the privacy issues that we are discussing now. However, the possibility of such a path of development is not a reason to prohibit the broad deployment of current technology, locking it in university laboratories. First, there is no guarantee that this will happen. Second, it may be that the only way we’ll have to create such smaller models is by first developing larger and larger models. It is also likely that such sufficiently large models can only be developed in commercial contexts, given not only the expense but also the needed experimentation and competition. Perhaps most importantly, even if we were certain that small frontier models would one day arrive, prohibiting wide, commercial deployment of LLMs would deny - perhaps for many years - the benefits in expression and in access to information that Europeans already enjoy, by itself a significant restriction of fundamental rights.