Data scraping for AI development

Machine learning and AI has been used in business for well over a decade and adoption of generative AI is now ubiquitous. As AI use and development reaches its teens, we might not be surprised that it expects freedom and wants to go in its own direction (in truth it probably already has) and that in doing so it is leaving a trail of difficult issues for its creators to address. With increased understanding of what AI is becoming and increasing awareness of how AI can operate, I have found my clients being asked, “Where does your training data come from?”, “Is it copyright?”, “Was collection permissioned?”, “How do you know it’s not biased?”. Inevitably, once the answers are provided and reviewed, the cognisant customer will say, “We need you to warrant your replies” and this is an area where many software providers stumble. 

Developers will usually have collected their data using third party scraping; images used for training may form part of huge data sets acquired from data collection agencies who scraped from social media sites like Instagram, Meta/Facebook, Twitter/X and TikTok. Some of that data scraping and training may have occurred before GDPR and the Data Protection Act 2018 and the source is not as “clean” or “permissioned” as regulation and law oblige and as Designated Authorities increasingly require.

Several weighty and useful articles have been produced by various large firms on the issue of data scraping and data protection. This piece looks at the issues from the client-side both in terms of the questions for clients to ask AI technology providers and the steps technology businesses need to take to be able to address this issue. It references applicable law and makes suggestions on how you can help your tech provider and tech procuring clients navigate the issues of using scraped data and AI developed from it.

Meta, data

Mid-way through 2024 Meta introduced new business terms granting to itself permission to use public and non-public user data collected since 2007 in order to train its AI models, definitively bringing to a close the open question or what Meta does with our data <spoiler alert> Meta uses our data to train its systems.

Acknowledging this is what it has always done, Meta’s announcement pegged the processing to “legitimate interests” under GDPR (and the UK DPA 2018) saying “we have legitimate interests in processing data to build these services”.

Relying on the “legitimate interests” basis under Art 6.1(f) of GDPR requires Meta to consider three key criteria and to satisfy all three parts of it before processing. The UK ICO explains the test as being

  1. is there a legitimate interest behind the processing; 
  2. is the processing necessary and the data is adequate, relevant and limited to (only) what is necessary; and 
  3. is the legitimate interest overridden by the individual’s rights and freedoms balancing of interests of the data subjects’ rights with the legitimate interest in processing. 

All three elements must be considered and satisfied for the exception to apply and for a business (such as Meta) to be able to claim it has legitimate interests in processing data. 

Does the fact that the data is needed to train the voracious appetite of the AI behind Meta’s services mean scraping data is justified and thus “legitimate”? At its core Meta is saying if you want the service, you have to participate in developing it. Is this the unwritten compact made with social media when we post that provides legitimacy, adequacy and relevance to this action? Various consumer groups and tech watchdogs have queried the legitimacy of Meta’s “legitimate interests” claim and it is facing privacy and complaints in 11 EU member states. It is a valid concern. 

In its press release about the change in its terms, Meta went on to say “people can object [to data scraping] using a form found in our Privacy Centre if they wish”, but it is questionable whether burying the ‘opt out’ deep in its Privacy Terms will enable Meta to argue that users can withdraw consent and balance their own interests against Meta’s. Certainly the UK ICO considers data scraping for generative AI development to be a high-risk activity. Where “invisible processing activity” occurs and people are unaware of the processing, the ICO cautions they may lose control over how and which organisations process their personal data, and so become unable to exercise their legal rights to protect it.

Meta noted that its approach is “consistent with how other tech companies are developing and improving their AI experiences in Europe”, in effect claiming that everyone is doing it so it must be OK.  Of course, most developers and the tech industry already knew big data is collected and used for training new programs, but that changes such as this make headlines and surprise and outrage users does indicate that this type of use of data (which includes dormant Facebook accounts, linked information and sharing that data scraped from online sources with third parties) surprises and concerns many. 

X has taken a characteristically contrary approach to coming out about data scraping. In April 2023, Elon Musk threatened to sue Microsoft over using data it had scraped from X to train MS’ OpenAI (where Musk had previously served as a Board member!) Potentially this was a tit-for-tat volley returned when Microsoft announced it would drop X from its advertising, but that it made headlines hinted that the ante had been upped in the data-scraping stakes.

In summer 2023, X launched X.AI stating it would only use publicly available data for training its AI. What is ‘publicly available’ in X’s estimation is not clear, but X seemed to continue to reach for the moral high ground and changed its business terms late in 2023 to prevent data-scraping without prior consent, with limits on the amount of data that can be scraped and only if the Xrobots.txt scraping tool – which turns off all other robot crawlers – is used. Xrobots does allow Google’s robot crawlers in on the back of a deal X has done with Google to drive traffic to X. The Google deal allows Google to crawl logged out and inactive accounts as well as active ones, giving it legacy data as well as current. X’s new terms also permit it to collect biometric data, which seems the complete opposite end of the spectrum from the “public data” it claimed would be used to train X.AI. This type of data scraping is open to challenge and must be permissioned – accepting X’s terms provides that permission.

Whilst Meta has chosen not to include private messages, which may indicate it has balanced the rights and freedoms of individuals with its legitimate interests, this simply confirms the scope that Meta has to read, research and use those messages, should it choose to.

As for the second part of the test, given the enormity of the data sets, it must be moot whether “only what is relevant and necessary” is collected. Perhaps Meta (and its peers) take the view that to be truly “artificial intelligence” (and possibly to also pass the Turing Test) its AI needs to know everything about being human, from our love of cat gifs to our peccadillos, our tempers and tantrums, grief and hysteria. Whilst Meta alone holds vast quantities of data (2.45 billion items are shared daily on just Facebook by its 4 billion users), and whether these are structured or unstructured, anonymised or identifiable, the question remains does any business truly have a legitimate interest to use “our” data without specific, informed consent? 

Meta’s massive generalisation that because the scraping is done using, and for the benefit of, AI technology (a non-specific term covering a vast array of technologies) then it is legitimate and that it is done by just about every business developing AI (which is true), is almost certainly not sustainable. So why make this claim knowing it will likely be challenged? Given that the new EU Artificial Intelligence Act (Regulation (EU) 2024/1689) seeks to harmonise development, placing on the market, putting into service and use of artificial intelligence systems in the EU, and the new law creates greater obligations on technology providers to seek certification for their AI products, it could be that Meta (etc) are changing their terms as a way of making declarations about the “geographical, behavioural, contextual or functional setting within which they are intended to be used” (Art 42.1) and thereby looking to shoehorn their tech into the “presumed compliant” category and avoid further interrogation and oversight.

Irrespective of the motivation for these changes (if they are not as stated, and we can only speculate about that), data scraping will continue because it is the most accessible way of securing data to train the AI models we envisage using in our future. As more data and better scraping leads to improved technology (arguably), what can be done from a practical perspective to protect users and to help businesses give their customers the type of assurances they require?

Know your source

Developing generative AI requires developers to collect (and in some cases to collate) and to process scraped data. That data is then applied to train the AI base model, revised, recalibrated and ultimately deployed (usually) in a beta format and the model then improved based on feedback.

Most developers of generative AI rely on publicly accessible sources for their training data, either engaging agencies to provide the data or scraping it themselves (larger businesses). Businesses procuring scraped data from data suppliers need to ensure those suppliers have been granted permission to do this. Developers using such agencies need to ask:

  • Is this data directly through web scraping, indirectly from another organisation that has web-scraped data?
  • Do you have express permission to scrape this data?
  • Who from?
  • When was that permission granted?
  • Show us that permission (or the exception that your data scraping falls under).
  • Does your contract permit you to licence or sell that data commercially?
  • Does the party you scraped from have permission to collect personal data?
  • Do the supplier’s Ts & Cs include a warranty about permission and an indemnity against claims for unauthorised use?

Developers scraping data in-house to train their AI need to:

  • ensure they know and record the data source with a data log;
  • check with the business whose data they scrape that they have permission to do this and use that data for AI training;
  • ask that businesses whose data is scraped to confirm they have permission from users to supply data for data scraping and training;
  • check the Ts & Cs of the business whose data is used to ensure there is a warranty about that data and an indemnity against claims for unauthorised use.

All developers also need to ensure that data they process, whether that is images, text, videos or other information, complies with data protection legislation. 

Biometric data also has to be specifically permissioned under GDPR and comply with BIPA, the Data Privacy Bill, the EU AI Act, POPIA and equivalents. 

Permissioned data, legitimate use

As noted above, the test for what is considered “legitimate use” is three-fold and, as part of complying with the legitimate use exception, developers need to ensure their processing is legitimate, necessary and balances the rights of humans.

Is it legitimate? A developer (and a client procuring development of anything using AI) must ensure the AI output:

  • it is not in breach of any laws (not just GDPR or BIPA or EU AIA); and
  • there is a lawful basis for the development.

The bar for entry is high, as it should be. To qualify for the first part of this test, the developer needs to:

  • ensure the data is not only permissioned but is not subject to any restrictions such as copyright or trade mark rights;
  • ensure or ask for a warranty that the data is not confidential, a trade secret, subject to sanction or in breach of any discrimination law. 

The second part of element one of the test is the lawful basis assessment. The developer’s “interest” could be developing AI for its own or for commercial use, either on their own platform or bringing it into the market for third parties to procure. The ICO is very clear on the lawful basis point though; a developer must be able to evidence the model’s specific purpose and ensure its downstream use will respect data protection and the data subject’s rights and freedoms, so data scraping for speculative development will never pass this test unless strict controls and monitoring are deployed in the model that is built.

Developers need to ask of their suppliers and to be able to answer the following when asked:

  • Is there a legitimate reason to scrape this data and use it?
  • Have all laws been complied with in relation to collection of this data?
  • Can we verify that copyright, patents, trade marks, biometric law and data protection has not been infringed and can we have a warranty and indemnity in respect of your assurances?
  • Do we have a clear and reasonable application for use of the data that is scraped?
  • Is it Necessary? This is a factual test. As most AI development requires the big data that data scraping facilitates, the use of the data set is likely to be necessary, provided that data set fulfils the purpose identified.

Developers need to show and to be able to justify if asked:

  • that it is only feasible to train the data using the large data set scraped in this way;
  • that there is a justification for using scraped data and a known output from that.

Finally, do individuals’ rights override the interest of the developer? If there is a legitimate purpose and data scraping is necessary for that, then the final consideration in the three-fold test for developers to review is whether the interests, rights and freedoms of individual data subjects override its defined purpose and its need for the scraped data.

The “upstream risk” of even one individual losing control of their data should not be ignored, but is almost certainly one that falls under Meta’s “everyone does it” argument and when few really know how their data is actually used, where it is stored and who by, this is surely a crooked yardstick against which to make an assessment. Downstream risk to individuals include use of the developed AI to create inaccurate information, reputational harm and social harm at the individual, community and county level. Surely if an AI program is created for these purposes or can be deployed for them, the programming is at fault and not the data used to train it and the likelihood of the threefold test being correctly or fairly assessed and applied is minimal anyway? Again the test is flawed. Be that as it may, in considering the downstream effects, a developer who is complying with their legal obligations must:

  • control and evidence whether the generative AI model is actually used for the stated purpose;
  • assess and record risks to individuals during development and post-deployment; and
  • take steps to mitigate risks to individuals.

Where a developer provides an API which facilitates development by others they can be specific about how the API is used. This can be done using licensing terms and contracts (of course) and reserving audit and monitoring capabilities and developers should attempt to do this, though the practicalities of knowing how APIs will be used present their own challenges and risks of security breach and breaches of confidential information.

Closing the loop

Developers need to make sure their own development loop is closed to scraping and that data they create and information they process for customers cannot be used to train other AI. On a simple level, if (for example) a developer creates an App that uses facial recognition, it needs to be able to identify where its training data came from to instruct the App to recognise faces, or if it used widgets to facilitate App development, the data used and permissions granted to train these plug ins. 

Data collected by the developer should be closed off and permissions for their own sites and information to be scraped should be denied in user preferences and privacy and security settings.  Telling customers this has been done demonstrates that the developer takes this seriously. It also helps to close the loop on copyright infringement or downstream claims being made where the work originated with and is therefore owned by the developer, but is then reused and a claim made against the customer or developer. In this respect advising developers to keep code logs and to verify the work of colleagues and the source of their work is more than just good development hygiene, it is a necessary precaution against litigation. 

Developers should also be aware that data processed for clients must not be available for scraping by third parties as doing so risks breaching obligations of confidentiality to the client and copyright infringement by third parties. This type of scraping could easily fall within a data breach in broadly drafted contracts received from customers (or provided by developers) and so this type of provision needs to be interrogated and updated to reflect the development work being undertaken and the permissions granted or refused. 

In their own business terms and conditions, developers should not give any warranties that it withholds information or prevents it from being scraped, however, as a guarantee to this effect can never be 100%. Instead, any information received from clients (images, text, video, code, trade marks, business names etc) should be warranted by the client and as much responsibility as possible should be placed onto the client to verify the source and ownership of the information it provides and the obligation to ‘lock that down’ should sit with the customer. The customer should also contractually confirm that data and information it identifies, procures or supplies to a developer cannot be scraped and if it can be or has been retrieved in this way, all necessary permissions for it to be created, stored and used by third parties have been secured and all laws applicable to it have been complied with. An indemnity in this respect is preferable.

Transparency and post-truth AI

The EU AIA requires developers who use AI to “generate or manipulate image, audio or video content that appreciably resembles existing persons, objects, places, entities or events and would falsely appear to a person to be authentic or truthful (deep fakes)” to clearly label them as such. The Commission also envisages it may “encourage and facilitate the drawing up of codes of practice at Union level to facilitate the effective implementation of the obligations regarding the detection and labelling of artificially generated or manipulated content” to help users know what is “computer generated” (Chapter 4, Art 4). That the ability to be ‘more human than human’ stems from data scraped with or without permission, used ethically or otherwise and developed with or without a defined purpose is ironic. The AIA has good intentions, but for now the legislation is like sticking a paper notice on a volcano saying “Caution, contents may be hot”.

Can a program self-certify and self-regulate? There seems to be ample opportunity for this and no restriction on doing so, provided the tests are met and the evidence of compliance is available. If AI is set to mark its own homework in this way, we have to ensure that it gets it right in terms of both the work and the marking!

Trying to squeeze the data-genie back into the bottle is futile, it’s too big and too clever for that – the information available already is so expansive it includes every facet of what makes us human and how we live on this planet. Data scraping has been done at scale for years with little regulation or ethical oversight and this must continue so long as the voracious appetite of AI requires constant updating with huge data sets. For now, the law is developing practical solutions, codes and regulations for what is essentially the biggest information heist (or data breach if you choose to be cynical) ever.

The advice to give clients is:

  • be cautious of handing over data and to record where it is released to, businesses and developers need to know this, and individuals should be told.
  • never warrant the information received and supplied is correct, complete or correctly permissioned, unless it is original source material that has remained under the control of a single entity.
  • ensure their own creative output and personal data is not made available for scraping.
  • reassuring individuals their data is safe because they have given permission for it to be used is inappropriate and probably incorrect too.

Announcing that everything you ever put online could be used to develop “an AI you” is alarmist, but this is probably the closest to the truth of what data scraping for AI development has been and will be used for. Given the determined effort to create “better than human” computers and (like proposals to terraform Mars) to “intelliform” the internet, there is every chance that AI will tell us how it needs to be regulated to achieve the AI-interpreted version of ethical, compliant and legal.

Until AI shows us a better way, lawyers need to participate in conversations about better regulation of use of personal data and the legal and ethical implications surrounding every human/machine interface. We need to be able to advise clients on how to close the back door to data-scraping, not simply to “check the box” on information use, and to be very clear about where the upstream flow of data comes from and how data output is used downstream, to avoid data scrapes and scrapes with data.

Joanne Frears is IP & Technology Leader at Lionshead Law, a virtual law firm specialising in employment, immigration, commercial and technology law. She advises innovation clients on all manner of commercial and IP matters and is a regular speaker on future law. Email j.frears@lionsheadlaw.co.uk. Twitter @techlioness.

Image Public Domain via Rawpixel.