AI and EU privacy law: June 2024 state of play
Despite the official publication of the EU Artificial Intelligence Act, debates on the legal status of AI in Europe continue. Two significant potential legal obstacles for AI development and use remain inadequately addressed: intellectual property and privacy laws. Here, I focus on the current debates regarding the application of EU privacy law to AI, an issue of critical importance as certain proposed interpretations could severely restrict or even prohibit much AI activity within the EU.
TLDR overview:
GDPR compliance challenges for AI model training:
"Legitimate interest" as only practical legal basis
Uncertainty in application, especially for special category data
French CNIL's guidance attempts to accommodate model training
Open vs. closed AI models:
Distinction has crucial implications for GDPR application
Some interpretations may impede EU AI development, especially open research
AI models and personal data:
Conflicting views from data protection authorities
Hamburg DPA: large language models don't store personal data
CNIL's guidance may allow similar conclusions for widely used models
GDPR requirements for AI models and outputs:
Challenges in applying data accuracy, rectification, and erasure
NOYB's complaint against OpenAI in Austria
Strict interpretations could effectively prohibit much AI development in EU
Potential impact on open research and less-resourced developers
Model training and personal data
Large language models are often trained using data scraped from the Internet. This data is either gathered by the model creators or sourced from providers like CommonCrawl. The inclusion of personal data in such datasets has raised questions regarding compliance with the EU General Data Protection Regulation (GDPR). The primary concern is whether processing personal data in this context can be justified under the "legitimate interest" legal basis.
While model developers with direct relationships to data subjects could theoretically rely on other legal bases, such as contractual necessity or prior consent, these options are impractical in most cases. Consequently, legitimate interest remains the only viable option for most large-scale data processing.
Some positive signals have emerged on this front. For instance, the French data protection authority (DPA), CNIL, has suggested that creating a lawful training dataset can often be considered legitimate. However, this general statement should be interpreted cautiously, as there are conditions for using the legitimate interest basis that could be applied stringently.
CNIL further elaborates that “an analysis is necessary to determine whether the use of personal data for this purpose” disproportionately infringes on data subjects' privacy, “even when the data is not nominative.” To ensure proportionate processing, CNIL recommends measures such as data pseudonymization, exclusion of sensitive data, and defining selection criteria to limit collection to relevant and necessary data.
A potential obstacle is that the legitimate interest basis does not apply to "special categories" of personal data (e.g., data on race, ethnicity, health, philosophical beliefs), which by default require prior consent for processing. When processing large amounts of data, it may be impossible to avoid collecting information that could allow inferences about individuals' special category data. An excessively strict application of GDPR rules on this issue could effectively require consent, thus prohibiting how many models are developed today.
CNIL acknowledges this challenge and suggests that if model developers take proper measures to minimize the risk of processing special category data, incidental and residual processing of sensitive data not intended for collection would not be considered illegal. However, it remains unclear what efforts will be required and whether authorities will resist setting an impractically high standard.
How does the GDPR affect open and closed AI?
The development process for most new models created from scratch (rather than by modifying or fine-tuning existing models) follows a similar pattern to the one described above. However, a significant distinction arises at the model storage and distribution stage. This distinction is between "open weight" models (sometimes called "open source" models) and those available only as a service (through APIs or user interfaces like ChatGPT). There is considerable debate about how this distinction affects the application of the GDPR to models at the stage when they are merely being stored or distributed as a collection of parameters (weights).
Broad GDPR application to model weights would hamper open development
If a model is considered to contain personal data as defined by the GDPR, then simply storing or sharing the model (even just uploading the model weights to platforms like HuggingFace) constitutes personal data processing under the GDPR. This may require those who store such models to (see CNIL’s summary):
inform data subjects (the individuals concerned) about the processing of their data;
allow data subjects to exercise their rights, including the rights to rectification and erasure of personal data (more on this below);
ensure that data transfers outside the EU comply with GDPR.
The exceptions to these and other obligations (e.g., for "purely personal or household activity") are narrow, and much of the development and experimentation in the EU would not qualify for them. If even popular open-weight models with low risks of personal data regurgitation and re-identification are considered to contain personal data, this could significantly impede AI development and use in the EU.
Despite some transparency benefits relevant to GDPR compliance (e.g., those noted by CNIL here), broadly applying the GDPR to model weights would make it difficult for the open development model to continue and would advantage closed AI applications.
Notably, the German Conference of Data Protection Authorities (DSK) expressed scepticism towards what they defined as "open" AI applications in their orientation guide. The DSK's definition of "open" appears to include not only open-weight models but also many models offered as public services. The DSK concluded that "technically closed systems are preferable from a data protection perspective." While this doesn't necessarily mean the DSK considers other models unlawful, it highlights the risk that privacy law could hinder open AI research and development.
Do models contain personal data?
The Hamburg DPA proposed the most well-reasoned interpretation in a discussion paper, concluding that large language models do not store personal data as defined by the GDPR. This interpretation suggests that the GDPR does not apply to these models, allowing developers of open-weight models to distribute them without GDPR restrictions. Furthermore, this interpretation implies that AI developers are exempt from GDPR obligations regarding the mere storage of these models. Consequently, individuals cannot request the "rectification" of "inaccurate" data about themselves in a stored model. The Hamburg DPA reasons that
individual tokens as language fragments ("M", "ia", " Mü" and "ller") lack individual information content and do not function as placeholders for such. Even the embeddings, which represent relationships between these tokens, are merely mathematical representations of the trained input. (...) In LLMs, the stored information already lacks the necessary direct, targeted association to individuals that characterizes personal data in CJEU jurisprudence: the information "relating" to a natural person.
Documents from the French CNIL appear to suggest that they may not share the view that AI models do not contain personal data, referring to models as potentially containing "memorised" personal data (e.g. here and here). However, if this is indeed CNIL's considered position, their justification appears legally less convincing than that of the Hamburg authority, at least concerning the most advanced and widely used models.
CNIL emphasises the "risks of regurgitation and extraction of personal data" from a model, which the Hamburg DPA also considers. The Hamburg authority, however, does not view this risk as grounds for considering models as containing personal data for three reasons:
such attacks would likely be impossible without access to training data (currently, the most performant and widely used models do not share training data);
these attacks would require efforts disproportionate to the potential benefits an attacker might hope to derive;
such attacks would be illegal.
It's important to note that this counterargument applies specifically to how the best general-purpose models are currently distributed. To dismiss this counterargument, one would need to adopt an extreme interpretation of the GDPR, according to which data is personal (and within the GDPR's scope) when there is even the slightest theoretical possibility of relating it to an individual. This interpretation would contradict the EU Court of Justice case law.
Some remarks in CNIL documents may suggest a tendency towards this legally incorrect interpretation, such as when they explicitly discuss formal guarantees of "zero" risk of re-identification. However, in the same document, CNIL also addresses the issue of the risk of re-identification, discussing the correct legal standard.
CNIL correctly points out that some models—not the most advanced ones in widespread use—may suffer from significant overfitting, potentially resulting in regurgitating training data without disproportionate efforts. This could mean that those particular models contain personal data, and their storage and distribution would be processing activities covered by the GDPR.
The apparent difference between CNIL and Hamburg documents might best be understood as a difference in emphasis. Hamburg focuses on the primary examples of advanced models, while CNIL attempts to cover edge cases involving less robust model development efforts.
Accuracy, rectification, and erasure
The GDPR, when applicable, requires data processing to adhere to the principle of accuracy. It also mandates that data subjects have the right to rectify inaccurate data and request data erasure. A specific question arises regarding current AI technology: how should these requirements apply to the models themselves and to the generation of outputs from these models (inference)? This question is crucial, as certain interpretations of these GDPR requirements could potentially lead to a de facto prohibition on much of AI development and use within the European Union.
NOYB v OpenAI complaint
The complaint filed by NOYB against OpenAI with the Austrian DPA illustrates the challenges of data accuracy and erasure requirements. When asked about a specific public figure, ChatGPT consistently provided an incorrect birthdate. OpenAI responded that it could not selectively correct or erase this information without blocking all data about the individual. The company argued that such a comprehensive block would infringe on its freedom to inform the public about a public figure and the public's right to be informed, both—we should note—protected by the EU Charter of Fundamental Rights.
NOYB contends that this inability or unwillingness to rectify inaccurate personal data breaches the GDPR's accuracy principle. Furthermore, NOYB argues that OpenAI's refusal to erase the incorrect birthdate violates the data subject's right to erasure (Article 17 GDPR) and rectification (Article 16 GDPR). The complaint emphasizes that technical limitations of an AI system do not justify non-compliance with the GDPR, asserting that if a controller develops software incapable of complying with data protection law, the processing is simply unlawful.
When OpenAI receives a request for erasure or rectification, it can use filters to block the display of personal data for individuals who request it. OpenAI likely maintains that re-training or fine-tuning their models to remove inferences related to specific individuals would be disproportionately costly or ineffective. This is why they use filters, which presumably either modify the user's input prompt or the model's output, aiming to prevent the disclosure of personal data.
Excessively stringent application of the requirements could de facto prohibit AI development and use
Accepting that requests for rectification or erasure require model developers or deployers to make very costly efforts (e.g., retraining entire models) could create a barrier for all but the most well-resourced developers, if even for them.
A more fundamental question is whether retraining could satisfy a request for erasure if the threshold for what is considered an output relating to an individual is set very low. For example, NOYB's complaint about asking for someone's date of birth likely reflects a "statistical inference from personal data provided in the prompt" (using CNIL's phrasing) rather than personal data processed by the model in any meaningful sense.
Some authorities have not yet considered this issue in depth. The German DSK's orientation guide suggested that GDPR rights to rectify inaccurate data and to erasure may not be addressed at the output generation stage, as does OpenAI. Instead, they proposed that rectification or erasure should be implemented by modifying the models themselves, explicitly referring to retraining and fine-tuning as potential solutions. They added that "filter technologies can help avoid certain outputs and thus serve the rights and freedoms of the persons affected by a certain output."
CNIL's preliminary approach is more nuanced. They explicitly consider the possibility that some data in model outputs relating to individuals constitutes "statistical inference from personal data provided in the prompt." In their view, "the processing of such data will be the responsibility of the user of the system" not of the developer or service provider.
However, CNIL insists that some models may contain personal data, and in such cases, they suggest retraining "should be considered whenever it is not disproportionate to the rights of the controller, in particular the freedom to conduct a business." While the reference to proportionality is welcome and appropriate, it is somewhat muddled by the following statement: "In practice, this will essentially depend on the sensitivity of the data and the risks that their regurgitation or disclosure would pose to individuals." This is puzzling given the omission of one of the most significant concerns: the cost of retraining.
CNIL notes that "retraining may take place periodically to integrate several requests" but refers to the GDPR deadline of three months. Estimates for a single training run for large models like Gemini 1.0 Ultra and GPT-4 are in the range of tens of millions of dollars (e.g. Heim, Epoch AI). An additional consideration for proportionality assessment is the environmental impact of energy use, especially if retraining is meant to comply solely with GDPR requests without other significant benefits.
This raises additional concerns about how imposing significant costs would affect less well-resourced developers and open model development. Would developers who provide open weights without monetizing them also be required to keep retraining models? CNIL appears to suggest that they expect the developer of the base model to continuously retrain it by stating that developers should be "contractually requiring" users "to use only a regularly updated version" (likely implying that CNIL has in mind full retraining rather than fine-tuning, which users could do themselves).
This demonstrates the importance of determining whether models themselves are to be considered as containing personal data under the GDPR. If they aren't—as suggested by the Hamburg DPA—then the rights to rectification and erasure, as well as the principle of accuracy, don't apply to them.
As the Hamburg authority noted, those requirements may then apply to other stages like prompt and output processing (allowing for OpenAI-style filtering), Retrieval Augmented Generation (RAG) modules, or internet search modules. Application of the GDPR to those stages doesn't pose the same significant problems as its application to the models themselves.