Home ยป OpenAI’s AI Utilizes Over One Million Hours of YouTube Videos for Text Generation, as Identified by the New York Times

OpenAI’s AI Utilizes Over One Million Hours of YouTube Videos for Text Generation, as Identified by the New York Times

The New York Times reported that three undisclosed sources have revealed OpenAI’s practice of consuming over a million hours of YouTube videos to train GPT-4, despite internal concerns about potential violations of YouTube’s terms of use.

Developers in the machine learning community, specifically in the LLM group, require vast amounts of text data to train AI models effectively. This text data must be of high quality and trustworthy to ensure accurate responses from the AI. OpenAI has been known to purchase such content at a rumored cost of approximately 1-5 million dollars per year.

Historically, AI training data has often consisted of similar datasets pulled from the web, with some researchers focusing on curated information from platforms like Wikipedia. However, expanding the dataset has become increasingly challenging, as continuous web scraping may yield lower-quality information.

Unlike OpenAI, other companies have a competitive advantage due to their proprietary platforms. Google has previously stated that they utilize YouTube content with content owner permission to train AI, while Meta has platforms like Instagram and Facebook, which could potentially be used with consent for AI training purposes. On the other hand, OpenAI lacks platforms with substantial user-generated content, except for ChatGPT, which relies on AI-generated content.

Source – New York Times

TLDR: OpenAI’s training practices raise concerns about data quality and user consent, while other companies leverage proprietary platforms for more effective AI training.

More Reading

Post navigation

Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Termination of Alphabet’s contractual arrangement with AI model training company Bard and Google Search impacts a workforce of two thousand.

Unveiling the Latest AMD EPYC and Instinct MI325X-Powered Server by HPE

Enhancing Collaboration: Reddit Teams Up with Google to Access Diverse Post Data for AI Model Training.