Conversation Enders

On absolutism, evidence, and the work of investigating AI in veterinary medicine

Jan 30, 2026

In order to arrive at what you do not know

I recently had an exchange on LinkedIn with Mr. Iain Richards1, a veterinarian and senior lecturer at Lancashire University in the United Kingdom. Mr. Richards has a lengthy list of publications and, I want to mention, vastly cooler hobbies than me.2

We disagreed about the utility of large language models in veterinary medicine. His position was stark: LLMs are “worthless,” offer “no QA,”3 provide “no evidence of genuine benefit,” and are “linked to unpleasant people.”4 The technology, in his view, offers “nothing except marketing pressure and the very mistaken belief they are of use.”

I don’t write this to relitigate the exchange or to pile on a colleague who holds a different view. I write it because the exchange helped to crystallize something I’ve been thinking about for a while, and I’ll name it as the problem of absolutism in professional discourse.

Words like “worthless” and “nothing” are conversation-enders. They’re not invitations to discuss evidence or explore nuance. It’s a stern declaration that the matter is settled. And in a profession that ostensibly advances through evidence, hypothesis, and peer review, that posture troubles me.

I asked Mr. Richards: Are you open to being persuaded?

It’s a question I try to ask myself. In fact, I rarely go into a discussion on these things without first exploring what’s true, what’s right, and what’s wrong. If someone handed me a well-designed study that contradicted my position, would I change my mind? If the answer is no, then I’m not holding a position, I’m holding a prejudice. And prejudices, however sophisticated they can be delivered, don’t serve our patients or our profession. While I can’t help being passionate, I can help being partisan.

And I maintain that speaking up, whether you’re right or wrong, is often an act of courage.

You must go by a way which is the way of ignorance

Mr. Richards insisted that he was open, and I started by looking for evidence that might refute the assertion that LLMs are worthless. Some of the same evidence that persuaded me, some of it I discovered in my search. It hasn’t been my experience that these things are worthless, but that’s just one man’s opinion - nothing more than compiled anecdotes. But research and such?

I turned to the honest-to-goodness experts: Anthropic, the company behind Claude, published internal data this past December examining how their own engineers use AI in their work. 132 engineers surveyed with 53 in-depth interviews and over 200,000 Claude Code transcripts analyzed. The findings: a 50% self-reported productivity gain, up from 20% just one year prior. More compellingly, the objective measures tracked a 67% increase in merged pull requests5 per engineer per day.

Twenty-seven percent of Claude-assisted work, the study found, simply wouldn’t have been done otherwise. Further, engineers reported becoming genuinely “full-stack,” that is to say working competently outside their core expertise.

And here’s a nuance that matters: engineers reported they can only “fully delegate” between 0 and 20 percent of their work. The rest requires active supervision and validation. Human-in-the-loop isn’t the exception for LLM coding use, it’s the norm.

The study acknowledges the paradox of supervision (you need skills to verify skills) and some engineers’ legitimate concerns about skill atrophy. These are legitimate concerns, and we will likely need both old mechanisms and new ones for the maintenance and advancement of those skills.

This is honest research, findings presented alongside limitations. But Anthropic builds AI. Of course their engineers find it useful, the skeptic might say. They are, of course, motivated to look on the bright side of their own product. While obviously and wildly competent, they are not divested from the outcome. Fair enough.

Let’s look at something closer to home.

*GPT-5.2, essay as prompt; I love how good and bad these first pass images have gotten; the guy driving the 1972 MGB is excellent, but the caveman wearing the baseball cap? Art.*

In order to possess what you do not possess

This month, the Journal of Veterinary Internal Medicine published a study that ought to get the attention of every veterinarian who’s ever struggled to explain immune-mediated hemolytic anemia to a confused pet owner or watched comprehension fade from a client’s eyes during a diabetes consultation. Which, I suspect, is just about every veterinarian who’s had a patient with IMHA or diabetes.

The study, led by researchers including Dr. Katie McCool, an internist and associate professor6 at NC State, evaluated ChatGPT-generated client handouts for three common but complex conditions: canine diabetes mellitus, canine immune-mediated hemolytic anemia, and feline inflammatory bowel disease. The methodology was rigorous: 50 pet owners evaluated the handouts for comprehension and satisfaction; 67 ACVIM-boarded small animal internal medicine diplomates assessed accuracy and clinical utility.

Pet owners demonstrated statistically significant improvements in disease understanding across all three conditions. Effect sizes between 0.78 and 0.84, ”large” by any reasonable standard. The effect size tells us not just whether something worked, but how much it worked. Values about 0.8 are considered “large” by statistical convention, and that translates to a meaningful improvement in client comprehension. It’s the kind of upgrade you’ll feel in practice.

Median satisfaction scores of 4 to 5 out of 5. And here’s what should matter to everyone who believes veterinary medicine ought to serve all clients: there was no association between education level and ease of understanding. The handouts worked whether you had a graduate degree or a high school diploma.

The diplomates, board-certified specialists, rated the materials with median accuracy scores of 4 out of 5. Seventy-one percent would use the diabetes handout with only minor or no revisions. Seventy-six percent for IBD. Even IMHA, one of the most challenging conditions to explain, earned approval from 50% of respondents with minor or no changes required.7

The authors noted that, even when prompted to write at a sixth to eighth grade reading level, the LLM didn't always comply. The readability scores came back higher than intended. But here's the thing: the handouts still worked across all education levels. Which makes me wonder if readability formulas, though often useful, don't capture everything that matters about whether someone actually understands what they're reading. The evidence was in the comprehension rather than the algorithm.

This is peer-reviewed evidence, published by Oxford University Press on behalf of the American College of Veterinary Internal Medicine. It empirically demonstrates measurable improvements in client comprehension, validated accuracy from specialists, and practical utility that could save veterinary teams hours while improving patient outcomes. I admire the researchers’ commitment.

You must go by the way of dispossession

Dr. McCool and her colleagues at NC State did something that requires genuine innovation in academic veterinary medicine8: they investigated a new technology, designed a rigorous study, and published findings that contribute real knowledge to our profession and the practice of medicine. This is how science is supposed to work. but with hypotheses, methodology, data, and peer review. And my problem with Mr. Richards’ statements is that, given his position as faculty of a veterinary college, he is in a position to prevent genuinely valuable research from being performed at all.

If Dr. McCool had found herself a student or resident of Mr. Richards, would she have been discouraged from or even allowed to pursue research like this?

I’m something short of naive about AI. I’ve written extensively about its limitations, the importance of human oversight, and the legitimate concerns about skill atrophy and over-reliance. I’ve criticized papers that demonstrate limited understanding of the technology as well as the ones that oversell it. I’ve called out bad AI writing as bad writing.

And the McCool study represents exactly the kind of work our profession needs: careful investigation of emerging tools, honest assessment of both benefits and limitations, and findings that can actually inform clinical practice. That 71-76% of board-certified internists would use these AI-generated handouts in practice isn’t a marketing blog post or a startup’s pitch deck. It’s data.

And data brings us back to the problem of absolutism.

When we say a technology is “worthless,” we foreclose the possibility that studies like McCool’s might teach us something, sometimes that such studies are even performed. When we declare there’s “no evidence of genuine benefit,” we reveal that we haven’t looked or that we’ve decided in advance what we’ll find.

I understand the impulse, I really do. New technologies arrive with breathless hype and marketing pressure, an endless string of social media posts and marketing emails. I just attended the VMX conference, and if I hear the word “innovative” one more time I’m going to hurl a thesaurus at a marketing team9. Skepticism is healthy and necessary, but skepticism means demanding evidence, not refusing to consider it.

There’s a particular discomfort that unsettles in my chest when I consider avoiding something altogether. It’s an acute feeling of cowardice, the sort that occasionally seeks to justify itself through sophisticated or moralistic excuses about complexity or concern or whatever. What I’ve come to rely upon is that courage isn’t found in grand declarations or in absolute comprehension, rather it lives in the simple, often frustrating, process of learning. Step by faltering step.

And what you do not know is the only thing you know

As I write this, I’m watching my almost-two-year son careen around the room. And it occurs to me that the definition of “baby steps” is an inaccurate one. Baby steps are not “tentative acts or measures which are the first stage of a long or challenging process.” Baby steps are wholehearted and reckless lurches into the unknown, with utter abandon and maximum intensity, with many a misstep and fall along the way.10 They’re not just chaotic, they’re committed. I like to think that science often benefits from such an attitude.

When I find researchers like Katie McCool doing the hard work of actually investigating these questions, when I see peer-reviewed evidence where before there were only marketing claims and dismissive hand-waves, I think we ought to say so. And cheer them on for it.

This is the work that advances our profession. Not absolutist declarations. Not categorical dismissals. The careful, honest, rigorous investigation of tools that might help us serve our clients, patients, and profession better.

Courage, commitment, and competence aren’t in knowing everything immediately and certainly not in announcing it as such. Far more often courage, commitment, and competence are realized in refusing to be defined by what we don’t yet understand.

In UK academic convention, "Dr." typically denotes a doctoral degree. Mr. Richards holds a BVSc (Bachelor of Veterinary Science) and MVetSci (Master of Veterinary Science). In the United States, veterinarians are commonly addressed as "Doctor" regardless of degree nomenclature. If he were practicing in the United States, he would be addressed as “Dr. Richards,” but I use the convention of his country rather than mine.

His professional bio includes a note that he enjoys driving a fully restored 1972 MGB, which is objectively cool, and “fellwalking,” which sounds more or less like a cooler version of hiking.

A demonstrably false claim, as extensive training and red-teaming in the development of large language models is well-documented.

What isn’t?

A “merged pull request” is when an engineer submits their code changes for review by colleagues, the code changes are approved, and then incorporated into the actual product. “Merged pull requests per engineer per day” is effectively a measure of completed, quality-checked work.

This piece previous referred to Dr. McCool as an internal medicine resident, based on NC State’s website - which has since been updated. Dr. McCool is board-certified, a diplomate of the American College of Veterinary Internal Medicine, and was at the time of the original publishing of this piece. I have updated the piece accordingly.

The authors noted that the LLM would not always follow the instruction to produce a document at a sixth to eighth grade reading level. As a guy who just reviewed a lot of seventh grade math and science with his stepson, I don’t know that it’s realistic to expect that things like diabetes mellitus and immune-mediated hemolytic anemia could be reduced to that level of understanding.

Worth noting that the work on this began in 2023, using GPT-3.5, so while the study was only just published, the work began when few were using LLMs this way and fewer still were doing real research on it.

More likely I will send a link to a marketing team via email. I actually lost my thesaurus recently, and I was depressed, heartbroken, melancholy, mournful, pessimistic, somber, and… I forget what else.

Not yours? Okay, well, mine does.

Doc's FIRE: Facts, Insights, Research, Education

Discussion about this post

Ready for more?