journalismAI.com | Revealed: The Authors Whose Pirated Books Are Powering Generative AI

Revealed: The Authors Whose Pirated Books Are Powering Generative AI | THE ATLANTIC

August 19, 2023

1–2 minutes

“. . . few people outside of companies such as Meta and OpenAI know the full extent of the texts these programs have been trained on.”
– Alex Reisner

The Atlantic shines a light on the literary datasets beneath large language models (LLMs). Very large quantities of words are essential for training the models to see word associations.

Using other people’s language for AI training is an undecided and murky area of intellectual property law.

Tens of thousands of copyrighted works, and possibly many more, may have been used to train LLMs now in general use. At issue is whether copyrighted works can be used for training without permission.

The exact composition of the training sets is not readily known beyond the few companies developing the LLMs. The author details the steps he took to discover one dataset’s contents.

Writing datasets use names like “Books1,” “Books2,” “Books3,” and “The PIle.”

SEE FULL STORY

Revealed: The Authors Whose Pirated Books Are Powering Generative AI | THE ATLANTIC | August 19, 2023 | by Alex Reisner

AI advances
AI in the newsroom
Ethics & standards
Explainers
Trust

SEE BY TYPE

Academic papers Blog Briefs 2014-2020 Op-Eds Quotes Reports Video

LATEST

Ethics & standards

Meet the One Woman Anthropic Trusts to Teach AI Morals | WSJ.MAGAZINE
AI advances, Trust

Albania Created an ‘A.I. Minister’ to Curb Corruption. Then Its Developers Were Accused of Graft | THE NEW YORK TIMES
Trust

Anthropic Releases A New ‘Constitution’ For Claude | FORTUNE
AI in the newsroom

Publishers prepare to be “squeezed” by AI and creators in 2026 | NEIMAN LAB

journalismAI.com is a public knowledge base supporting AI literacy for journalists, maintained since 2017. The indexed collection is updated regularly and contains 700+ entries, each curated and written by humans.

Click on a subject above or use the search box in the masthead.

About Us
Privacy Policy
Contact Us
The Next Disruption

Revealed: The Authors Whose Pirated Books Are Powering Generative AI | THE ATLANTIC

SEE BY SUBJECT

SEE BY TYPE

LATEST

Meet the One Woman Anthropic Trusts to Teach AI Morals | WSJ.MAGAZINE

Albania Created an ‘A.I. Minister’ to Curb Corruption. Then Its Developers Were Accused of Graft | THE NEW YORK TIMES

Anthropic Releases A New ‘Constitution’ For Claude | FORTUNE

Publishers prepare to be “squeezed” by AI and creators in 2026 | NEIMAN LAB

Discover more from journalismAI.com