GPT Goes to Vet School
A recent study reviews the efficacy of ChatGPT answering vet test questions.
The recent study from veterinarians at the University of Georgia to evaluate OpenAI’s GPT-3.5 and GPT-4 on veterinary test questions marks a worthwhile step toward integrating artificial intelligence, specifically large-language models, into veterinary research. Their initiative reflects a growing curiosity and readiness to embrace AI advancements in our field, although I can’t help but notice errors in the text that seem to represent an inadequate understanding of the technology.
ChatGPT is not an information-processing tool, but rather a language-processing tool. It mimics and generates text from its pre-trained data. Asking it to process information is like testing how well your car functions as a boat.1 The car will do poorly, but it was probably not designed to float.
Still, the study by Drs. Coleman and Moore opens the door to valuable discussions, and its approach invites further dialogue. Recognizing the complexity of designing scientific studies, especially those around evolving technologies like AI, I see an opportunity for a broader context and deeper understanding in future research.
It's heartening to see such studies emerge, although I’m a bit ambitious about these things, so I’m always hoping for a broader scope and richer context that could enhance our understanding. But in a variety of ways, the study indicates a limited familiarity with AI's functionality. The study's methodology and analysis suggest such limitations in understanding. It also points to a growing need for interdisciplinary collaboration in research efforts and a more robust peer review process that includes experts in the technology being tested.
The title and focus of the study appear to set a specific narrative regarding AI's capabilities in a negative comparison to veterinary students. While it's natural for students, having undergone specific training, to perform well, the AI models, notably GPT-4, demonstrated notable proficiency without specifically pre-trained preparation. The outcome without specific information, especially the 77% achievement by GPT-4, is not only impressive but indicative of AI's potential as a supplementary tool in education.
The students were presumably given specific source material and time to learn it. The AI models were not “pre-trained” on the specific information, the researchers instead relying on the models’ pre-training from internet sources. It’s possible that the subject matter of the test questions is not widely available on the internet. If the AI had been provided with the source material as the students had been, I suspect we would’ve seen different results. If the students had only been allowed to study the internet before the examination, we can safely assume a reduced performance. The discussion might have benefited from a closer comparison under similar conditions, fostering a more nuanced understanding of AI's role and potential in veterinary education.
The critique also extends to certain methodological choices, such as evaluating AI through a format less suited to its strengths. As Dr. Ethan Mollick’s recent LinkedIn post indicates, the AI will usually do its best approximation of what’s asked of it, so an evaluation limited in scope will often “prove” the researchers’ point. Asking a single question at a time would be much better suited to the AI’s strengths. It would be considered “best practices” for AI use and would’ve yielded results more representative of the AI models’ capabilities. And students, after all, only answer one question at a time. Because of the testing methodology in this study, we learn as much or more about the AI models’ ability to answer multiple multiple-choice questions as we do about their ability to answer veterinary test questions.
Furthermore, the study's reflections on AI's limitations, while valid, must also underscore the importance of recognizing its evolving nature. The updates and enhancements in AI models, including the capability to process and interpret images, highlight the dynamic progress in the field. GPT-3.5 was released in March of 2022, GPT-4 was released in March 2023. In a year, the model made a jump of almost 50% improvement despite not being used in its most effective fashion.
I know I often sound full of breathless wonder at the technology, but I’m fairly impatient with research that does not measure up when it comes to measuring up. Acknowledging these advancements can enrich our perspective on how AI tools might complement traditional veterinary education and practice, but the paper seems to ignore that such advancements exist.
The paper also contains a number of other factual errors, and I have detailed some below.2
This dialogue is crucial, not only for critiquing but for building upon the foundational work laid by researchers like Drs. Coleman and Moore. Their willingness to explore the intersection of AI and veterinary science paves the way for future inquiries that might more fully harness AI's capabilities, guided by a thorough understanding and thoughtful application. We don’t often by being good at something. Missteps are necessary on the path to expertise.
I encourage my colleagues to engage with AI research actively and constructively. Through collaborative exploration and critical yet open-minded engagement, we can ensure that our forays into AI not only advance our scientific understanding but also uphold the values and commitments central to veterinary medicine. We have the opportunity to shape the future and to do so in a way that hampers, stifles, or invalidates is a discredit to our profession.
The initiative taken by Drs. Coleman and Moore is a step toward this goal, albeit a stumbly one. It beckons the veterinary community to participate in shaping a future where AI and veterinary practice enhance each other in pursuit of improved animal care and health outcomes. I hope that future studies will reflect the researchers’ advanced understanding of the technology they set out to evaluate.
Your car is not a boat. Please do not test it.
Researchers stated, “Another limitation of the current study was that assessment of ChatGPT’s understanding of the material relied on the formulation of answers to multiple choice and true/false questions in a list format, without explanations. As a result, this limited our ability to better appreciate the knowledge base and clinical reasoning of the platform.”
The researchers specifically instructed the AI models to abstain from explanations.
The researchers stated that, “A second limitation is that ChatGPT was created from data produced before September 2021.”
The researchers state that “ChatGPT is constantly updated by OpenAI.”
The researchers state, “While these tools can offer valuable insight or information, they lack nuanced understanding, clinical experience, and ethical considerations essential in clinical decision-making. Reliance on AI-generated content carries risk of misinterpretation of information, incorrect diagnoses, and inappropriate treatment planning. As such, it is critically important for students to assess AI-generated information, cross-reference it with reputable sources, consult experienced professionals, and integrate their own expertise prior to making medical decisions.”
Answering a question on a test is not a medical decision.
The researchers state, “In the current study, questions that included images were omitted because neither platform has the ability to recognize and interpret images.”
The researchers state, “As has been identified by others, AI occasionally provides spurious results. This poorly understood phenomenon, coined artificial hallucination, occurs when AI provides confident answers that cannot be explained by the training data alone. Irrespective of the overall performance, the consistency of results is also important for educators and students to recognize. A key finding of the current study was that ChatGPT-3.5 was correct on all trials for only 33% of questions and ChatGPT-4.0 was correct on all trials for only 69% of questions. Identifying why ChatGPT makes these mistakes may help guide students in interpretation of spurious results.”
The spurious results can be partly explained by the testing methodology, particularly the multiple-choice questions and the prompting instructions.
LLMs are language-processing tools, not information-processing tools.
The researchers state, “Nonetheless, the findings of the current study provided, to the authors’ knowledge, the first evaluation of the performance of large language models in veterinary education and should caution veterinary students about concluding that AI-generated information is uniformly accurate.”
I would offer the same caution about veterinarians being uniformly accurate.