“. . . few people outside of companies such as Meta and OpenAI know the full extent of the texts these programs have been trained on.”

– Alex Reisner

The Atlantic shines a light on the literary datasets beneath large language models (LLMs). Very large quantities of words are essential for training the models to see word associations.

Using other people’s language for AI training is an undecided and murky area of intellectual property law.

Tens of thousands of copyrighted works, and possibly many more, may have been used to train LLMs now in general use. At issue is whether copyrighted works can be used for training without permission.

The exact composition of the training sets is not readily known beyond the few companies developing the LLMs. The author details the steps he took to discover one dataset’s contents.

Writing datasets use names like “Books1,” “Books2,” “Books3,” and “The PIle.”


Revealed: The Authors Whose Pirated Books Are Powering Generative AI | THE ATLANTIC | August 19, 2023 | by Alex Reisner