Health

Duke innovates on implementing and assessing AI in health spaces

WUNC | By Bianca Garcia

Published June 26, 2025 at 3:50 PM EDT

A red stethoscope sits on a desk in the foreground while a doctor in a white coat types into a desktop computer behind. — With the introduction of AI-assisted ambient scribing tools, doctors in the Duke Primary Care network can cut down on time spent taking digital notes during and after each consultation.

As artificial intelligence enters the healthcare space, Duke University researchers are working to make sure their application is safe and fair.

Large Language Models (LLMs) are making administrative tasks in healthcare more efficient. LLMs are a class of AI that includes ChatGPT and Meta’s LLaMa, among others. They respond to questions asked of them in a human-like way by predicting the most likely word order in a sentence based on the information they are trained on. They are being used to take medical notes and respond to patient portal messages.

Especially in the sensitive space that is the clinic, researchers emphasize the importance of AI governance—the practice of reviewing and assessing AI tools—to ensure that the technology is implemented in an ethical, transparent, and accountable way.

Solving 'pajama time'

Imagine that after a long day of seeing patients, your doctor gets home, changes into their lounge clothes and plops down on the couch. Instead of turning on the T.V., they open up their work laptop to finish taking notes. “Pajama time,” as these after-hours tasks are sometimes called, can contribute to feelings of burnout in medical practitioners.

Ambient digital scribing (ADS) tools are one example of how LLMs can increase doctors’ efficiency and decrease pajama time. They are integrated into the electronic health record interface that clinicians already use. With a patient’s consent, the doctor presses “record,” and an AI transcribes and summarizes the consultation in a medical note. Then, after the meeting, the doctor can adjust the note to ensure its accuracy.

“I didn’t realize until I started using ambient technology how much of my attention during the visit I was devoting to being a courtroom transcriptionist,” says Dr. Eric Poon, a primary care physician at the Durham Medical Center and Chief Health Information Officer at Duke Medicine.

Poon is involved with the implementation and roll out of technologies such as ADS.

“With a possible exception with telemedicine during COVID-19, I have not seen anything adopted this quickly,” reflected Poon.

He said between 60% and 70% of patient visits in the Duke primary care network are conducted using ADS tools nowadays.

Working with ADS has helped Poon stay on schedule on days with stacked appointments and made it possible for him to give more attention to the patient sitting in front of him. Not to mention, it has restored some balance to his life.

“On clinical days, I easily get two hours back,” said Poon. “And my experience is not unique. I hear that a lot from other colleagues who have used the tool.”

Evaluation and Governance of AI

Duke University is an international leader in the field of responsible AI implementation and AI governance, according to Michael Pencina, Chief Data Scientist at Duke Health.

In 2020, when AI use became more widespread, there was a rush to bring it to healthcare. Pencina said he and his team decided that no new algorithms will be used on patients at Duke Health until it underwent review by their group, ABCDS, or Algorithm-Based Clinical Decision Support Oversight.

“Everybody's excited about the burden reduction,” says Pencina, “but what we would want–the kind of Holy Grail of this– is [that], kind of in the background, there is monitoring that... makes sure that these tools are doing what they are supposed to, so the clinicians can just use them without worrying about their performance.”

Chuan Hong designs such monitoring frameworks. She is an assistant professor in the Department of Biostatistics and Bioinformatics at Duke, and works with ABCDS.

“We all agree right now that LLMs need more and more ongoing review,” she said.

Yet, there is a hold-up. Humans need to evaluate AI to capture clinical nuance, but that defeats the point of reducing human workloads. To work around this, Hong designed a framework to use AI to help humans evaluate other AI. The frameworks were recently published with test cases on ADS and AI-automated patient messaging systems.

One study introduces the framework SCRIBE, which stands for Simulation, Computational metrics, Reviewer assessment, Intelligent evaluations for BEst practice. The goal is to scale SCRIBE to other technologies.

“We want to enlarge this and work towards automating it, so that when the next large language model…comes in, we're ready to evaluate it even faster and monitor it in a way that’s less human intensive,” Pencina said.