AI LLM Outperforms Physicians in Diagnosis
Google's DeepMind team's paper in the NEJM demonstrated an LLM significantly out-diagnosing physicians. Or did it?
The paper "Towards Accurate Differential Diagnosis with Large Language Models" (McDuff et al) focuses on the development and evaluation of a Large Language Model (LLM) optimized for clinical diagnostic reasoning. The primary goal was to assess whether this LLM could assist clinicians in generating more accurate differential diagnoses (DDx).
I offer a short breakdown, along with my thoughts.
Key Findings
Performance: The LLM demonstrated a significant capability in generating differential diagnoses. When used alone, it exceeded the performance of unassisted clinicians in terms of top-10 accuracy (59.1% vs 33.6%). Additionally, clinicians assisted by the LLM showed higher DDx quality scores compared to those without its assistance.
“Top-10” accuracy measures whether the correct diagnosis is included in the list of the top 10 diagnoses suggested. I’m a little wary of this metric as a standalone conclusion of diagnostic superiority, especially for veterinary medicine. I don’t often get to run ten tests on a sick patient, and frankly, I’m lucky if a client sticks with me for more than three. My diagnostic heuristics are organized around the reality of limited resources, and I usually start broadly and then narrow the focus based on initial findings.
Truthfully, I don’t find myself making differential lists of ten possibilities all that often. It’s a tough case if I need to go beyond four. I’d be interested in how the LLM against physicians did a narrower accuracy. “Top-5” maybe?
In my experience - more “anecdote” than “use case” - the AI models are more helpful in producing that fifth, sixth, and seventh idea than they are at producing a top-4. I think there is value in LLMs, but I’m not ready to trust them alone.
“Diagnosis” is the identification of the nature of an illness or other problem by examination of the symptoms or clinical signs. A differential list that includes the diagnosis is not the same as a diagnosis.
Improvement Over Traditional Methods: Compared to clinicians using search engines and standard medical resources, those assisted by the LLM generated more comprehensive DDx lists with higher quality and appropriateness scores.
My issue with this study is not that LLMs outperformed clinicians; it’s in what was measured and how it’s being interpreted. There’s a world of difference between a differential diagnosis and a diagnosis.
Utility in Challenging Cases: The study specifically focused on challenging real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports, demonstrating the LLM's potential to improve diagnostic reasoning and accuracy in complex scenarios.
This is the key to the value of the LLMs assisting clinicians. AI systems seem to be excellent at those rare and unusual ideas, but that’s not most of what I do. If I have a feline patient with hematuria, I’ll very rarely need a diagnosis list longer than four things (crystalluria, UTI, FLUTD, neoplasia). Ten is, from a practical perspective, ridiculous.1
Comparison with GPT-4: The LLM's performance was also compared with GPT-4, a general-purpose large language model, indicating that the specialized LLM was more effective in clinical diagnostic scenarios.
This matters because broadly-trained LLMs like GPT-4 have frequently beaten specifically-trained LLMs in the performance of tasks like diagnosis.
Methodology
Study Design: 20 clinicians evaluated 302 medical cases from NEJM case reports, using either traditional resources or the LLM for assistance in generating DDx.
Performance Metrics: The study used top-10 accuracy, quality score, appropriateness, and comprehensiveness as key metrics to evaluate the performance.
With top-10 accuracy as a primary metric, I remain skeptical. Worth noting, however, that the authors do not oversell the model.
User Interface: A user interface was designed for clinicians to interact with the LLM, providing an environment for querying and receiving responses based on the case descriptions.
I’d like to know more about this, such as how it worked or if the LLM fabricated or hallucinated.
Implications
Enhanced Diagnostic Accuracy: The LLM for DDx shows potential in assisting clinicians in reaching more accurate diagnoses, especially in complex cases.
IBM’s Watson was a pretty famous failure in the realm of artificial intelligence applied to medicine, but we’ve come a long way in the last ten years.
Educational Tool: Clinicians indicated the utility of the LLM for educational purposes, suggesting its potential in medical training and learning. I’d be interested in seeing more details on the nature of the cases and the LLM, but it’s hard to imagine that this doesn’t present tremendous potential for teaching students and doctors. The notion of a “clinical batting cage” like this is exciting. I want one.
Wider Application: While the study focused on NEJM case reports, the findings suggest a broader applicability of the LLM in various clinical settings. There’s more to learn, more to try, more science to be done. It certainly affirms the idea that these models can be used to augment the practice of medicine.
Limitations
Real-world Applicability: It’s a wicked world out there. The study cautions against directly extrapolating the findings to suggest the LLM's utility as a standalone diagnostic tool, given the unique nature of the NEJM case reports used. The cases used were among the rarest and most unusual to merit publication in the New England Journal of Medicine.
Complexity of Cases: The NEJM cases represent particularly challenging scenarios and likely do not reflect the more routine cases seen in daily clinical practice.
Model Limitations: The LLM's performance in more complex cases and its integration with multimodal inputs (like images and lab results) were not fully explored.
But darn, I can’t wait to see what it can do with visual inputs like photographs and X-rays.
Conclusion
The study finishes by asserting the LLM for DDx is a promising tool for assisting in differential diagnosis, especially in complex cases. It highlights the need for further research to understand the full potential and limitations of such models in clinical settings. It’s understated in its conclusion, calling the LLM “helpful.” That’s all, just “helpful.”
I think being an excellent diagnostician is among the most important responsibilities of someone in my line of work. I don’t have symptoms to work with, only clinical signs. A physical examination, a second-hand patient history, and then the possibility of some diagnostic testing are all I get to work with in hunting a diagnosis. And the difference between my success and failure can be the difference between an animal’s life and death.
Tools like this give us a chance to be better at this crucial skill, and I think this is one of the most valuable ways that AI can augment our care of patients. I’m not sure I’d conclude anything much more strongly that the authors did in calling their AI “helpful” in reaching a diagnosis, but I’m hopeful that when I need such help it will be available as robustly as this.
Unless, of course, you happen to be that rare case that doesn’t align with the other thousand or so.
I agree with your critiques of this study, particularly creating far larger than typical differential lists. In a narrow test optimized for "how often did the single right diagnosis make it into the list?" the obvious way to win that contest would be to include as many exotic and irrelevant diagnoses as possible (I.e. throwing spaghetti at the wall and seeing what sticks, to use a scientific metaphor). You could ask an LLM to come up with a HUNDRED or a THOUSAND differentials and it would happily oblige while a person would probably be strained to even think of applicable DDx after a few dozen.
I would be far more curious to see tests like these:
1) ask LLMs and humans to come up with a SINGLE free-text "working diagnosis" (not multiple choice pick lists) and compare the accuracy. My hypothesis would be humans would actually be more accurate at this as it requires higher order reasoning, experience, and probabilistic ranking under uncertainty
2) have both come up with detailed diagnostic plans for a set of signs/test results. Again, it would not be challenging for an LLM to just spit back out the Quest or LabCorp test catalog, but that isn't something a focused or reasonable clinician working with limited resources would ever do