This paper, titled “Performance Evaluation of Large Language Models in Medicine”, was presented at the International AI in Health Congress in 2024 by Ms. Sama Khoraminejad, Ms. Zahra Vatan-Khah, and Dr. Mehrnoosh Shams-Fard.

Abstract:
This study aimed to evaluate the performance of 11 large language models (LLMs) in responding to Persian medical questions. The questions were categorized into two groups: general and specialized. General questions included common inquiries and widespread misconceptions, while specialized questions were derived from Iran’s pre-internship and medical residency exams. These questions covered diverse topics, including obstetrics and gynecology, dermatology, general surgery, general diagnostics and treatments, infectious diseases, psychiatry, orthopedics, gastroenterology, neurology, pediatrics, nutrition, otolaryngology, internal medicine, pharmacology, ophthalmology, and radiology.

The evaluated models included GPT-4o, Claude-3.5-Sonnet, GPT-3.5, Gemini, Llama-3-8b, Llama-3-70b, PersianMind, Gemma-2b-it, Dorna, Aya-23-35b, and CoMMand-r, focusing on the Persian language and varying model sizes.

The results showed that GPT-4o and Claude-3.5-Sonnet demonstrated the best performance, while Gemma-2B-IT exhibited the weakest results. Challenges such as difficulties in identifying medications, recognizing Persian trade names, and providing precise information in certain specialized fields were observed. These findings highlight the need to develop specialized language models and optimize them to deliver accurate and reliable information in the medical domain.