A research team from the Human-Centered Artificial Intelligence (HAI) Institute at Stanford University has released a report on the performance of AI usage in the LLM group. Despite reports that LLM can diagnose diseases remarkably well, there is a high margin of error that needs to be monitored.
The testing team evaluated the use of LLM using four models: GPT-4, Claude 2.1, Mistral Medium, and Gemini Pro. Particularly, GPT-4 developed an app called Retrieval Augmented Generation (RAG) to test how accurately these LLMs can generate responses with proper references.
The accuracy measurement consists of three levels: 1) verifying the authenticity of the sources, ensuring AI does not fabricate URLs; 2) confirming that each sentence has real sources to support it; 3) overall evaluation of whether the answers have genuine supporting sources or not, utilizing 1,200 questions.
Results show that LLMs tend to create their own sources quite frequently, even GPT-4 producing inaccurate sources up to 30%. This issue can be mitigated by utilizing RAG where most sources are real. However, when examining the answers, it was found that many parts lacked supporting sources within the responses, even with RAG GPT-4 having a high percentage of such answers.
The research team noted the excitement around LLMs scoring higher on tests than medical students. However, real doctors need a comprehensive evaluation beyond multiple-choice exams used to assess LLMs.
Source: Institute for Human-Centered AI
TLDR: The report highlights the limitations of current AI models in accurately referencing sources and the need for further evaluation of AI capabilities compared to human professionals.
Leave a Comment