AI companies are facing mounting lawsuits, and they might be able to avoid legal issues in the future if they could remove data that violates copyright.
– Katyanna Quach
“Follow the data” is good advice for understanding the provenance of large language models, particularly when it comes to vexing issues such as copyright infringement, potential privacy breaches, and bias. But it’s easier said than done, as training data often comes from massive collections that combine multiple types of information from many sources.
Researchers in Amazon’s AWS division are working on a way to purge pieces of a training set without sacrificing all content and starting again.
‘Disgorgement’: Amazon researchers suggest ways to get rid of bad AI data | SEMAFOR | May 1, 2024 | by Katyanna Quach