Apple’s research team recently published a new study on Ferret-UI, a Generative AI designed to address the limitations of Multimodal Large Language Models (MLLM) when dealing with highly detailed image inputs such as screen captures.
The challenge with screen capture images lies in their unique aspect ratios, differing from the standard training images AI models are accustomed to. With icons or buttons often small and low-resolution, AI may struggle to differentiate them, especially when they are crucial points of focus in the input.
Ferret-UI stands out by being trained on screen images with various commands and tasks, allowing it to identify icons, extract essential text, and even interpret widget data better than other models. Testing has shown its superior performance over GPT-4V and other MLLM models that focus on screen images.
While the study highlights the model’s success, it does not specify the practical applications of Ferret-UI. It remains unclear whether Apple intends to enhance this AI’s capabilities for all users due to privacy concerns. Nonetheless, it could prove beneficial for users with visual impairments seeking improved accessibility.
Source: 9to5Mac
TLDR: Apple’s research introduces Ferret-UI, a cutting-edge Generative AI tailored for intricate image inputs like screen captures, showing promise in surpassing existing models in interpreting screen content with superior accuracy and performance. Its practical applications and widespread adoption, however, remain uncertain, pending further developments.
Leave a Comment