Can You Trust an AI Medical Scribe? What 5 Peer-Reviewed Studies Found

Nahom Nigussie,
AI medical scribe medical documentation AI hallucinations evidence-based medicine clinical AI HIPAA compliant AI scribe AI scribe accuracy

Disclosure: Nahom Nigussie is the founder of Aclera AI. This analysis reflects his perspective on the AI scribe market, informed by building a citation-first documentation platform.

The note looks flawless. A complete SOAP assessment, appropriate ICD-10 codes, a treatment plan that reads like you wrote it yourself. The AI scribe captured everything from your fifteen-minute encounter, formatted it perfectly, and dropped it into your EHR before you finished washing your hands.

But here’s the question no one at the vendor demo answered: How would you know if it fabricated a drug interaction? Misattributed a symptom to the wrong system? Invented a family history detail that sounds plausible but never came up in conversation?

You wouldn’t. Not without listening to the recording yourself, which defeats the purpose of having an AI scribe in the first place.

This is the central tension physicians face with AI documentation tools: systems that promise to save hours of charting time while asking you to trust outputs you cannot verify. For a profession built on evidence-based medicine, that’s a significant ask.

The Black Box Problem

Most AI medical scribes operate on a simple pipeline. Ambient microphones capture the patient encounter. The audio feeds into a large language model. The model produces a clinical note. You review it, sign it, and move on.

What happens inside that model stays inside that model.

The AI doesn’t show you why it concluded the patient’s chest pain is musculoskeletal rather than cardiac. It doesn’t cite the clinical criteria it used to rule out red flags. It doesn’t link its medication recommendations to guidelines or flag potential interactions with documented allergies. It just… outputs a note.

This opacity creates a fundamental conflict with how physicians are trained to practice. Evidence-based medicine, as Sackett and colleagues defined it in their foundational 1996 BMJ paper, requires “the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients.” The key word is explicit. EBM demands showing your work.

AI scribes ask physicians to do the opposite: sign documentation generated by a system that cannot explain its reasoning.

70% of AI Scribe Notes Contain Errors

The research on AI documentation accuracy paints a more complicated picture than vendor marketing suggests.

A 2025 study in the Journal of Medical Internet Research tested two commercial AI scribes across 44 draft notes from simulated encounters. The sample was small but revealing: 2.9 errors per note on average, with 70% of notes containing at least one error (Biro et al., J Med Internet Res 2025;27:e64993). While the sample size limits generalizability, the error patterns align with larger studies. Omission errors (details from the encounter that the AI simply failed to capture) comprised 54-83% of mistakes. These are the most dangerous kind because catching them requires remembering what was said, not just reading what was written.

A separate study in Frontiers in Artificial Intelligence compared AI-generated notes directly to physician-authored documentation. The result: 31% of AI notes contained hallucinations versus 20% in physician notes (p=0.01) (Palm et al., Front Artif Intell 2025).

The most rigorous assessment comes from Asgari and colleagues, published in npj Digital Medicine. Their team analyzed 12,999 clinician-annotated sentences across 450 clinical notes and found an overall hallucination rate of 1.47%. That sounds reassuring until you read the next line: 44% of those hallucinations were classified as major, meaning errors that could impact patient diagnosis or management if uncorrected (Asgari et al., npj Digit Med 2025;8:274).

The error taxonomy is revealing. Fabricated content accounts for 43% of hallucinations, where the AI invented information that wasn’t in the encounter. Negations comprise 30%, where the AI reversed the meaning, stating a patient denied a symptom they reported, or vice versa. Contextual errors account for 17%, and causality errors 10%.

The plan section of clinical notes showed the highest concentration of major hallucinations at 21%. The part of the note that tells you what to do next is the part most likely to be dangerously wrong. This is precisely why Aclera focuses on Assessment & Plan generation. If the research shows A&P is where AI documentation fails most dangerously, that’s where verification infrastructure matters most.

OpenAI’s own HealthBench Hard evaluation confirmed a 1.6% hallucination rate for GPT-5 on its most challenging medical scenarios, an 8x improvement from earlier models, but still present. That benchmark used 5,000 multi-turn conversations with 48,562 unique rubric criteria validated by 262 physicians across 26 specialties (arXiv:2505.08775, May 2025). Even the best models, evaluated on the most rigorous benchmarks, make mistakes.

Perfect Diagnosis, Deadly Treatment

Benchmark performance creates a false sense of security.

Frontier language models now score above 95% on medical licensing exams. GPT-5 achieved 95.84% on MedQA and 95.22% across USMLE Steps 1-3, exceeding human pre-licensed expert performance by 24-29 percentage points (Wang et al., arXiv:2508.08224, August 2025). OpenAI’s o1 leads at 96.5% (Vals AI Leaderboard, January 2026). A meta-analysis of 83 studies in npj Digital Medicine found no statistically significant difference between generative AI and physician diagnostic accuracy overall (p=0.10) (Takita et al., npj Digit Med 2025;8:175).

These numbers suggest AI has reached physician-level competence. The reality is more nuanced.

A December 2025 medRxiv preprint (not yet peer-reviewed, small sample of 15 vignettes) tested five frontier models including GPT-5.2, Claude 4.5 Opus, and Gemini 3 Pro. All five achieved 100% diagnostic accuracy (Risheq AN, medRxiv 2025). Every model correctly identified bacterial meningitis in a patient presenting with classic symptoms.

Then came Case 15: meningitis in a patient with documented severe IgE-mediated penicillin anaphylaxis.

Two of the five models, Gemini 3 Pro and Claude 4.5 Opus, recommended carbapenems. For a patient who would die from that drug class.

The study’s conclusion cuts to the core issue: “Diagnostic perfection does not equal clinical safety. A model can be 100% correct in diagnosing a patient and yet fail 100% in keeping them safe if it does not rigidly respect contraindications.”

GPT-5.2 and Kimi K2 passed the safety test by treating the contraindication as a rigid rule. The difference came down to what researchers call “safety alignment”: whether models treat absolute contraindications as negotiable clinical judgments or hard stops.

This is why Aclera runs on GPT-5.2 via Azure OpenAI. In head-to-head testing, it treated “patient will die from this drug class” as a non-negotiable constraint rather than a factor to weigh.

Benchmark scores measure what AI knows. They don’t measure whether AI will kill your patient.

The Automation Bias Trap

Even when physicians review AI outputs, the evidence suggests they don’t catch errors as often as they think.

The definitive study on automation bias in clinical AI comes from Dratsch and colleagues, published in Radiology in 2023. Twenty-seven radiologists (11 inexperienced, 11 moderately experienced, 5 very experienced) evaluated mammograms with AI assistance. When the AI provided correct suggestions, accuracy ranged from 79-82% across experience levels.

When the AI deliberately provided incorrect suggestions, accuracy collapsed.

Inexperienced radiologists dropped from 79.7% to 19.8%, a 60 percentage point decline. Moderately experienced radiologists fell from 81.3% to 24.8%. Even highly experienced radiologists saw accuracy drop from 82.3% to 45.5% (Dratsch et al., Radiology 2023;307(4):e222176). All comparisons reached statistical significance (p<.001 to p=.003). The effect sizes were massive (r=0.93-0.97).

A JAMIA systematic review of 74 studies found that erroneous clinical decision support increased incorrect decisions by 26% overall (Risk Ratio 1.26, 95% CI 1.11-1.44, p<0.0005) (Goddard et al., J Am Med Inform Assoc 2012;19(1):121-127).

The implication for AI scribes is uncomfortable. Physicians are being asked to verify documentation produced by systems that look authoritative, sound confident, and are wrong often enough to matter. The research suggests we’re not as good at catching those errors as we assume.

The Documentation Burden Is Real

The solution isn’t rejecting AI documentation. The pressure physicians face is genuine.

Sinsky and colleagues observed 57 physicians across four specialties for 430 hours. They found physicians spend 27% of their office day on direct patient contact versus 49% on EHR and desk work, nearly a 2:1 ratio of documentation to clinical care (Sinsky et al., Ann Intern Med 2016;165(11):753-760). Physicians reported an additional 1-2 hours of “pajama time” nightly finishing charts.

The Shanafelt longitudinal series in Mayo Clinic Proceedings documents physician burnout at 45.2% (Shanafelt et al., Mayo Clin Proc 2025;100(7)). The Medscape 2024 Physician Burnout Report confirmed 49% burnout, with 62% citing bureaucratic tasks as the primary cause. Emergency medicine leads specialty burnout at 63%.

Canadian physicians lose an estimated 18.5 million hours annually to administrative burden, equivalent to 55.6 million patient visits (Canadian Federation of Independent Business, January 2023).

AI scribes address a genuine problem. The question is whether they can do so without creating new ones.

The first randomized controlled trial comparing AI scribe products directly was published in NEJM AI in November 2025. Lukac and colleagues randomized 238 outpatient physicians to Microsoft DAX Copilot, Nabla, or usual care across approximately 72,000 patient encounters. Nabla reduced time-in-note by 41 seconds, a 9.5% decrease. DAX Copilot showed no significant time savings versus control. Both products were used in only about 30% of visits. Clinicians reported inaccuracies “occasionally” on a 5-point scale, with omissions and pronoun errors most common (Lukac et al., NEJM AI 2025;2(12)).

Modest benefits. Persistent accuracy concerns. Limited adoption even when tools are provided.

Toward AI That Shows Its Work

The answer lies in transparency.

The FDA’s June 2024 guidance on transparency for machine learning-enabled medical devices, published jointly with Health Canada and the UK MHRA, establishes a framework that applies here. The “What” principle specifically requires sharing “the logic of the model when available” and “limitations including biases, confidence intervals, and data characterization gaps.”

A JMIR AI systematic review found that explainable AI increased clinician trust in 50% of studies, had no effect in 30%, and could either increase or decrease trust in 20% depending on explanation quality (Rosenbacke et al., JMIR AI 2024;3:e53207). The critical finding: explanations only enhance trust when they are clear, concise, and clinically relevant. Complexity backfires.

What does clinically relevant transparency look like for documentation AI?

This is why we built Aclera differently.

Aclera surfaces two layers of transparency most AI scribes hide. First, the reasoning process: you see the logical chain the AI used to reach its clinical conclusions. Not just the finished note, but how it got there. “Patient presents with productive cough and fever. Per ATS/IDSA CAP guidelines, this presentation warrants empiric coverage. Given documented penicillin allergy, recommending respiratory fluoroquinolone per IDSA alternative pathway.” You see the clinical logic, step by step.

Second, inline citations: every substantive claim links to PubMed, IDSA, AAFP, or ATS sources you can verify yourself. Click through. Read the guideline. Disagree if the reasoning doesn’t hold.

This isn’t a claim to explain neural network weights. No AI system can do that, and no vendor who claims otherwise is being honest. It’s something more practical: visibility into clinical reasoning, backed by sources you can check. The AI shows its differential. You verify against the evidence. You sign what you agree with.

When Aclera generates an assessment, the reasoning appears alongside the citation. The point isn’t that AI is always right. The point is that you can always check, both the logic and the source. See how it works →

Over 40 federal courts now require disclosure when AI is used in legal filings, after 716+ documented cases of lawyers submitting fabricated AI-generated citations. In medicine, the stakes are higher. A hallucinated case citation embarrasses a lawyer. A hallucinated drug interaction harms a patient. Aviation’s European Union Aviation Safety Agency AI Roadmap 2.0 requires AI systems to be “explainable, predictable, traceable.”

Medicine should demand the same standard.

The Question Worth Asking

The technology will improve. Error rates will decline. Models will get better at respecting contraindications and capturing the details that currently slip through.

But the fundamental question physicians should ask vendors won’t change: Can I see why it wrote this?

A tool that saves time but can’t answer that question asks you to practice medicine on faith. A tool that shows its reasoning (the logical steps, the clinical criteria applied, the guidelines referenced) asks you to practice medicine the way you were trained: with explicit justification you can evaluate and challenge.

For physicians evaluating AI documentation tools, that distinction matters more than benchmark scores, time savings, or feature lists. The right question isn’t “Is this AI accurate?”

It’s “Can I verify it myself?”


See What Verifiable AI Documentation Looks Like

Aclera shows its work: both the reasoning and the sources. Every AI-generated claim displays the logical chain that produced it, linked to PubMed, IDSA, AAFP, or ATS sources you can verify yourself.

Try Aclera Free →

No credit card required. 14-day trial.