“In many ways, the fate of synthetic data sits at the center of the biggest questions facing generative AI.”

– Reed Albergotti

Semaphor reports on work by Microsoft researchers to use data-from-data as a new way to train generative AI models without touching copyrighted materials. Neutal networks like ChatGPT and others depend on vast datasets to form their text in response to queries.

Some observers argue synthetic data – data generated by AI models – could lead to a kind of intellectual in-breeding, eventually degrading the generated text to rubbish.

Others see data-from-data as a highly efficient way to help produce tailor-made text for a given use; for example, a model that learns by summarizing all the data in a narrow domain and then draws upon only that synthesis in order to generate new responses. Proponents say this outcome could be even more relevant than processing countless peripheral words and concepts.

Semaphor also points to work on synthetic data by IBM and Google’s DeepMind.


New synthetic data techniques could change the way AI models are trained | SEMAPHOR | November 3, 2023 | by Reed Albergotti