GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial

Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 330, 78–80 (2023).
Google Scholar
Cabral, S. et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern. Med. 184, 581–583 (2024).
Google Scholar
Tu, T. et al. Towards conversational diagnostic AI. Preprint at (2024).
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Preprint at (2023).
Goh, E. et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw. Open 7, e2440969 (2024).
Google Scholar
Zaboli, A., Brigo, F., Sibilio, S., Mian, M. & Turcato, G. Human intelligence versus Chat-GPT: who performs better in correctly classifying patients in triage? Am. J. Emerg. Med. 79, 44–47 (2024).
Google Scholar
Truhn, D. et al. A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports. Sci. Rep. 13, 20159 (2023).
Google Scholar
Cook, D. A., Sherbino, J. & Durning, S. J. Management reasoning beyond the diagnosis. JAMA 319, 2267–2268 (2018).
Google Scholar
Ledley, R. S. & Lusted, L. B. Reasoning foundations of medical diagnosis: symbolic logic, probability, and value theory aid our understanding of how physicians reason. Science 130, 9–21 (1959).
Google Scholar
Bordage, G. Prototypes and semantic qualifiers: from past to present. Med. Educ. 41, 1117–1121 (2007).
Google Scholar
Bowen, J. L. Education educational strategies to promote clinical diagnostic reasoning. N. Engl. J. Med. 355, 2217–2225 (2006).
Google Scholar
Cook, D. A., Stephenson, C. R., Gruppen, L. D. & Durning, S. J. Management reasoning: empirical determination of key features and a conceptual model. Acad. Med. 98, 80–87 (2023).
Google Scholar
Mercuri, M. et al. When guidelines don’t guide: the effect of patient context on management decisions based on clinical practice guidelines. Acad. Med. 90, 191–196 (2015).
Google Scholar
Schmidt, H. G., Norman, G. R., Mamede, S. & Magzoub, M. The influence of context on diagnostic reasoning: a narrative synthesis of experimental findings. J. Eval. Clin. Pract. 30, 1091–1101 (2024).
Google Scholar
Parsons, A. S., Wijesekera, T. P. & Rencic, J. J. The management script: a practical tool for teaching management reasoning. Acad. Med. 95, 1179–1185 (2020).
Google Scholar
Reverberi, C. et al. Experimental evidence of effective human–AI collaboration in medical decision-making. Sci. Rep. 12, 14952 (2022).
Google Scholar
Kempt, H. & Nagel, S. K. Responsibility, second opinions and peer-disagreement: ethical and epistemological challenges of using AI in clinical diagnostic contexts. J. Med. Ethics 48, 222–229 (2022).
Google Scholar
Restrepo, D., Rodman, A. & Abdulnour, R.-E. Conversations on reasoning: large language models in diagnosis. J. Hosp. Med. 19, 731–735 (2024).
Google Scholar
Friedman, C. P. et al. Enhancement of clinicians’ diagnostic reasoning by computer-based consultation: a multisite study of 2 systems. JAMA 282, 1851–1856 (1999); erratum 285, 2979 (2001).
Google Scholar
Miller, R. A., Pople, H. E. Jr & Myers, J. D. Internist-1, an experimental computer-based diagnostic consultant for general internal medicine. N. Engl. J. Med. 307, 468–476 (1982).
Google Scholar
Chen, Y. et al. SoulChat: improving LLMs’ empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations. Preprint at (2023).
Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 183, 589–596 (2023).
Google Scholar
Tai-Seale, M. et al. AI-generated draft replies integrated into health records and physicians’ electronic communication. JAMA Netw. Open 7, e246565 (2024).
Google Scholar
Chen, S. et al. The effect of using a large language model to respond to patient messages. Lancet Digit. Health 6, e379–e381 (2024).
Google Scholar
Pfeffer, M. A., Shah, N. H., Sharp, C. & Lindmark, C. Nigam Shah and partners roll out beta version of Stanford medicine SHC and SoM Secure GPT. Stanford Medicine (2024).
Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. Preprint at (2023).
Core IM. American College of Physicians www.acponline.org/cme-moc/internal-medicine-cme/internal-medicine-podcasts/core-im (2024).
Pell, G., Fuller, R., Homer, M. & Roberts, T. How to measure the quality of the OSCE: a review of metrics—AMEE guide no. 49. Med. Teach. 32, 802–811 (2010).
Google Scholar
Khan, K. Z., Ramachandran, S., Gaunt, K. & Pushkar, P. The Objective Structured Clinical Examination (OSCE): AMEE Guide No. 81. Part I: an historical and theoretical perspective. Med. Teach. 35, e1437–e1446 (2013).
Google Scholar
Cook, D. A., Durning, S. J., Stephenson, C. R., Gruppen, L. D. & Lineberry, M. Assessment of management reasoning: design considerations drawn from analysis of simulated outpatient encounters. Med. Teach. 1–15, (2024).
Singaraju, R. C., Durning, S. J., Battista, A. & Konopasky, A. Exploring procedure-based management reasoning: a case of tension pneumothorax. Diagnosis 9, 437–445 (2022).
Google Scholar
Jones, J. & Hunter, D. Consensus methods for medical and health services research. BMJ 311, 376–380 (1995).
Google Scholar
Meskó, B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J. Med. Internet Res. 25, e50638 (2023).
Google Scholar
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
Google Scholar
Gallo, R. J., Savage, T. & Chen, J. H. Affiliation bias in peer review of abstracts. JAMA 331, 1234–1235 (2024).
Google Scholar
Gallo, R. J. et al. Establishing best practices in large language model research: an application to repeat prompting. J. Am. Med. Inform. Assoc. 32, 386–390 (2025).
Google Scholar
Goh, E. et al. GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Figshare (2025).
link