The renowned newspaper, The New York Times (NYT), has filed a lawsuit against both Microsoft-owned Microscoft and OpenAI, accusing them of using its articles without permission. Not only did they utilize these articles in their LLM training data set, but they also employed them in their ChatGPT and Copilot systems, effectively generating new content for users.
OpenAI has not disclosed the specific data set used to train its GPT-4 model. However, in the case of GPT-2 and GPT-3, they openly admitted to utilizing the WebText and WebText2 data sets, which consist of vast amounts of high-quality web data. Sources include highly popular links on Reddit and data extracted from a staggering 410 billion tokens obtained from web scraping. Remarkably, out of all the data in the Common Crawl data set, only Google Patents and Wikipedia surpass the amount of data from NYT.
The lawsuit states that NYT had attempted to negotiate with both Microsoft and OpenAI prior to this incident, but no agreement could be reached. This case bears similarities to Getty Images’ lawsuit against Stability AI for using their images to train similar artificial intelligence systems.
TLDR: The New York Times has filed a lawsuit against Microsoft and OpenAI, alleging unauthorized use of its articles in training data sets and AI systems. OpenAI’s data sets, including Common Crawl, contain substantial amounts of NYT data. Negotiations prior to the lawsuit were unsuccessful, drawing parallels to a similar copyright infringement case involving Getty Images and Stability AI.
Leave a Comment