For many creative professionals, publicly posting to the Internet is a necessary part of
the job. Steven Zapata, a concept artist and illustrator speaking on behalf of the Concept
Art Association, said that, “to advertise our work, most of us put our art online, on social
media and our personal websites. This leaves it exposed to unethical scraping practices.”
These “unethical scraping practices” have been questioned within academia,
31
and AI
researchers have clearly stated that using training data that has been obtained from
public sources does not inherently mean that “authorial consent” has been obtained.
32
In addition to the scraping of work belonging to creative professionals, Bradley Kuhn, a
policy fellow at the Software Freedom Conservancy, pointed out that depending on the
platforms they use, creative professionals “may have already agreed for their own
creative works to become part of the company's machine learning data sets” because of
what is said in those platforms’ terms of service agreements. Several tech companies
made the news over the summer after they updated their terms of service to include
references to building AI with user data,
33
eliciting backlash from artists in at least one
instance.
34
In some cases, participants said they weren’t even the ones to post their works online in
the first place. Tim Friedlander, president and founder of the National Association of
Voice Actors, pointed out that, “it's incredibly easy to use AI to capture the voice of an
26
See Jordan Hoffman et. al, Training Compute-Optimal Large Language Models, arXiv (Mar. 29,
2022), https://arxiv.org/pdf/2203.15556.pdf
27
See Touvron et al, supra note 5.
28
See Ilia Shumailov et al., The Curse of Recursion: Training on Generated Data Makes Models Forget,
arXiv (May 31, 2023), https://arxiv.org/abs/2305.17493.
29
See, e.g, Wayne Xin Zhao et. al, A Survey of Large Language Models, arXiv (Nov. 24, 2023),
https://arxiv.org/pdf/2303.18223.pdf.
30
See Kevin Schaul et al., Inside the secret list of websites that make AI like ChatGPT sound smart, The
Washington Post (Apr 19, 2023), https://www.washingtonpost.com/technology/interactive/2023/ai-
chatbot-learning/.
31
See e.g., Signe Ravn et. al, What Is “Publicly Available Data”? Exploring Blurred Public–Private
Boundaries and Ethical Practices Through a Case Study on Instagram, Journal of Empirical Research on
Human Research Ethics, Volume 15 Issue 1-2, at 40-45 (May 19, 2019)
https://journals.sagepub.com/doi/full/10.1177/1556264619850736; See also Antony K. Cooper et. al,
On the Ethics of Using Publicly-Available Data, Responsible Design, Implementation and Use of
Information and Communication Technology, at 159-171 (Mar 10, 2020)
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7134274/.
32
See Leo Gao, et. al, The Pile: An 800GB Dataset of Diverse Text for Language Modeling, arXiv, at
Section 6.5 (Dec. 31, 2020), https://arxiv.org/abs/2101.00027.
33
See Matt G. Southern, Google Updates Privacy Policy To Collect Public Data For AI Training, Search
Engine Journal (Jul. 3, 2023) https://searchenginejournal.com/google-updates-privacy-policy-to-collect-
public-data-for-ai-training/490715/; See also Brian Merchant, Column: These apps and websites use
your data to train AI. You’re probably using one right now., Los Angeles Times (Aug. 16, 2023)
https://www.latimes.com/business/technology/story/2023-08-16/column-its-not-just-zoom-how-
websites-and-apps-harvest-your-data-to-build-ai.
34
See Michael Kan, Artists Drop Twitter Over Elon Musk's Plan to Train His AI Project on Tweets,
PCMag (Aug. 1, 2023), https://www.pcmag.com/news/artists-drop-twitter-over-elon-musks-plan-to-
train-his-ai-project-on-tweets