Evaluating large language models (LLMs) and AI in the medical field is challenging due to the complexity and variability of medical cases. Traditional benchmarks, such as the United States Medical Licensing Examination (USMLE), provide a standardized measure of medical knowledge and diagnostic reasoning. However, the true test of an AI’s capabilities lies in its performance in real-world scenarios, often assessed through side-by-side diagnostic tests comparing AI-generated differential diagnoses to those of experienced doctors. Remarkably, GPT-4 has consistently outperformed human doctors in many of these comparisons, demonstrating superior accuracy and efficiency in diagnosing a wide range of conditions. This performance underscores the potential of AI to enhance medical practice, though it also raises important questions about the integration of such technologies into healthcare systems and the need for robust evaluation frameworks.

GPT 4 vs Doctors

This chart illustrates the diagnostic performance comparison between GPT-4, doctors, and doctors using a search tool on the NEJM CPC complex diagnostic challenges benchmark. The chart measures top-k accuracy, which indicates the frequency with which the correct diagnosis is among the top k suggestions. Specifically, it shows the top-1, top-3, and top-10 accuracies for each group. GPT-4, the latest state-of-the-art language model, outperforms doctors when they are unaided. However, when doctors utilize a search tool, their diagnostic accuracy significantly improves, often matching or exceeding the performance of GPT-4. This demonstrates the substantial impact that advanced search capabilities and tools can have in enhancing the accuracy of complex medical diagnoses.

GPT4 on medical benchmarks

Table 1 compares the performance of different models on the USMLE Self-Assessment. The models being compared are GPT-4 and GPT-3.5 ht old model, both in “5 shot” and “zero shot” settings. The “5 shot” setting means the model is given five examples before being tested, while the “zero shot” setting means the model is tested without any prior examples.

Table 2 compares the performance of different models on the USMLE Sample Exam. This dataset was considered in a study denoted as [KCM+23]. The models compared are GPT-4 and GPT-3.5, both in “5 shot” and “zero shot” settings, as well as an independently reported ChatGPT score.