OpenAI has introduced the SWE-Lancer testing suite, derived from 1,488 programming tasks on the Upwork platform, with varying pay ranging from $50 to $32,000. The total pay for the suite amounts to $1 million, with AI earning scores as it solves individual tasks. From the total $1 million task pool, specific pay is assigned to the IC SWE subtest, emphasizing programming work, with a full score of $236,000. The current top-scoring model is o3-high, introduced today, earning $65,250, while o4-mini-high earns $56,375, twice as much as o1-high. Despite the significant gap from the full score, this testing suite is poised to showcase AI’s future development compared to SWE-Bench Verified, where o3 currently scores 69.1%. Notably, Claude 3.5 scores up to $58,000, surpassing o4-mini-high. Further analysis reveals that all AI models excel in backend work but struggle in UX/UI scoring. The testing suite is available on GitHub, albeit lacking support for multimodal, hence the absence of visual aids.
TLDR: OpenAI unveils SWE-Lancer test suite with $1 million worth of programming tasks, showcasing AI’s capabilities and areas for improvement.
Leave a Comment