The New York Times reported that three undisclosed sources have revealed OpenAI’s practice of consuming over a million hours of YouTube videos to train GPT-4, despite internal concerns about potential violations of YouTube’s terms of use.
Developers in the machine learning community, specifically in the LLM group, require vast amounts of text data to train AI models effectively. This text data must be of high quality and trustworthy to ensure accurate responses from the AI. OpenAI has been known to purchase such content at a rumored cost of approximately 1-5 million dollars per year.
Historically, AI training data has often consisted of similar datasets pulled from the web, with some researchers focusing on curated information from platforms like Wikipedia. However, expanding the dataset has become increasingly challenging, as continuous web scraping may yield lower-quality information.
Unlike OpenAI, other companies have a competitive advantage due to their proprietary platforms. Google has previously stated that they utilize YouTube content with content owner permission to train AI, while Meta has platforms like Instagram and Facebook, which could potentially be used with consent for AI training purposes. On the other hand, OpenAI lacks platforms with substantial user-generated content, except for ChatGPT, which relies on AI-generated content.
Source – New York Times
TLDR: OpenAI’s training practices raise concerns about data quality and user consent, while other companies leverage proprietary platforms for more effective AI training.
Leave a Comment