AI Performs Poorly in Clinical Decision-Making: Accuracy as Low as 13%, Far Inferior to Human Doctors

Testing large language models in the role of emergency department physicians to explore their performance and potential in medical scenarios. Evaluate the model's understanding of emergency medical situations, diagnostic capabilities, and the accuracy of treatment recommendations, revealing the advantages and limitations of artificial intelligence in clinical decision support.

Research has found that the current state-of-the-art large language models (LLMs) still perform significantly worse than human doctors in clinical diagnosis:

  • Doctors' diagnostic accuracy is 89%, while LLMs' diagnostic accuracy is only 73%. For some diseases (such as cholecystitis), LLMs' accuracy is as low as 13%.

  • LLMs perform poorly in following diagnostic guidelines, arranging necessary examinations, and interpreting laboratory results, easily missing important information or making hasty diagnoses.

  • LLMs also have problems following basic medical guidelines, making errors every 2-4 cases and fabricating non-existent guidelines every 2-5 cases.

  • Providing more case information actually reduces LLMs' diagnostic accuracy, indicating their inability to effectively process complex information.

  • Specialized medical LLMs do not significantly outperform general LLMs in overall performance.

Researchers believe that LLMs still require extensive clinical supervision to be safely applied. Future research should further validate the effectiveness of LLMs in real clinical environments and strengthen collaboration between AI experts and clinicians to optimize the application of LLMs in the medical field.

Nevertheless, AI still has enormous potential in the medical field. For example, Google's Med-PaLM2 model has already reached expert level in some medical tests. In the future, AI may play an important role in assisting diagnosis and medical research, but it is still too early to replace human doctors.

Original link