Google Cloud has introduced the Dynamic Workload Scheduler to address a problem faced by customers who cannot request to use graphic chips or TPUs for training artificial intelligence models. This is due to a shortage of available chips. The service operates in two modes.
In Flex Start mode, customers have to notify the system about the number of chips required and the duration of usage. The system will then queue the request until the desired number of chips becomes available and initiate the job. This mode is suitable for short experiments or fine-tuning tasks that can be completed in just a few minutes up to a maximum of 7 days.
On the other hand, the Calendar mode is a pre-booking system for clusters. It requires a usage notice of 7-14 days in advance and allows booking up to 8 weeks ahead.
This approach is reminiscent of the batch processing system in 1960s computers, when they were very expensive and did not have the parallel processing capabilities we have today. Users had to submit their jobs on punched cards or magnetic tapes in advance, and when it was their turn, the system would load and execute the jobs, printing the results. Similarly, the issue of insufficient computational resources for training AI models arises today, leading to a return to this old approach.
TLDR: Google Cloud has launched the Dynamic Workload Scheduler to solve the problem of limited access to graphic chips or TPUs for AI training. The system offers two modes, Flex Start and Calendar, with Flex Start allowing short experiments for up to 7 days and Calendar enabling pre-booking of clusters for 8 weeks. This system resembles the batch processing approach used in early computers and addresses the scarcity of computational resources for AI training.
Leave a Comment