1 Comment

I agree with your critiques of this study, particularly creating far larger than typical differential lists. In a narrow test optimized for "how often did the single right diagnosis make it into the list?" the obvious way to win that contest would be to include as many exotic and irrelevant diagnoses as possible (I.e. throwing spaghetti at the wall and seeing what sticks, to use a scientific metaphor). You could ask an LLM to come up with a HUNDRED or a THOUSAND differentials and it would happily oblige while a person would probably be strained to even think of applicable DDx after a few dozen.

I would be far more curious to see tests like these:

1) ask LLMs and humans to come up with a SINGLE free-text "working diagnosis" (not multiple choice pick lists) and compare the accuracy. My hypothesis would be humans would actually be more accurate at this as it requires higher order reasoning, experience, and probabilistic ranking under uncertainty

2) have both come up with detailed diagnostic plans for a set of signs/test results. Again, it would not be challenging for an LLM to just spit back out the Quest or LabCorp test catalog, but that isn't something a focused or reasonable clinician working with limited resources would ever do

Expand full comment