Cutting-Edge AI Development: Unpacking the Trendy Approach of Sourcing Public Data for Training Purposes

Former engineer and AI developer Ed Newton-Rex made the decision to leave Stability AI at the end of last year. Uncertain about the direction of AI development trained with copyrighted content, he founded Fairly Trained to ensure AI models are trained with accurate purchased datasets or public domain data accessible to everyone.

Newton-Rex stated that the standard company response when asked about the data used to train AI is “publicly available data,” which can be misleading as it implies permission to collect and not actively seek out the data. This differs significantly from using data in the public domain.

Legal consultant Timothy K. Giordano, involved in lawsuits against several AI development companies, pointed out that much of the publicly available data is actually protected by copyright laws, including scraping data from websites, which is illegal.

AI development companies often cite the precedent from a past case involving Google Books where it was ruled that using some copyrighted publicly available data is fair use. They also argue that AI is trained or “learns” from various data sources, similar to human learning methods, and does not simply copy this content.

The issue of using publicly available data became prominent after an interview with Mira Murati, the CEO of OpenAI, where it was questioned whether the data used to train AI to create the Sora video was sourced from YouTube. Later, Neal Mohan, CEO of YouTube, expressed concerns that if OpenAI used YouTube videos for training, it would breach terms of service.

Axios compiled explanations from major AI developers on their data sources for training. OpenAI mentioned using purchased licenses and publicly accessible internet data, Google uses data found online, with websites that can restrict AI training data access, Meta uses publicly accessible online datasets for training Llama 2, and Microsoft utilizes various publicly accessible online data sources to comply with copyright and legal requirements.

In conclusion, AI developers must navigate the complexities of data usage and adhere to legal standards to ensure fair and ethical AI development practices.

TLDR: Former AI developer Ed Newton-Rex left Stability AI due to uncertainty in AI development using copyrighted content, founding Fairly Trained for AI models trained with accurate purchased datasets or publicly accessible data. Legal issues arise as publicly available data can be protected by copyright laws, prompting the need for transparency and adherence to legal standards in AI development practices.

Cutting-Edge AI Development: Unpacking the Trendy Approach of Sourcing Public Data for Training Purposes

More Reading

Rumors Swirl: Next Generation iPad Pro and iPad Air Set to Debut on May 6th, Pro Model Expected to Come with Heftier Price Tag

Innovative Cash Withdrawal Service Now Available at K PLUS ATM Machines from Bangkok Bank

Leave a Comment

Leave a Reply Cancel reply

More Reading

Post navigation

Leave a Comment

Leave a Reply Cancel reply

Top-tier Blackwell Chip Powers First-in-World SoftBank NVIDIA DGX SuperPOD Purchase, Setting a New Benchmark in Japan.

Revamping OpenAI’s Corporate Structure for Increased Autonomy: Sam Altman Secures Initial Stake

Apple and US government agreement to develop secure AI after 15 tech companies lead the way.