Can you trust the codes an AI proposes? The honest answer
A clinician's guide to how an AI coding assistant fails safe: codes grounded in your notes, uncertainty flagged not faked, and your sign-off as the last word.
A compliance lead reads a vendor demo and hits the same silent question every time: what happens on the visit that does not fit the pattern? The clean cases are easy. It is the ambiguous note, the comorbidity that almost-but-doesn’t qualify, the procedure documented in shorthand. That is where a coding tool either earns trust or quietly invents a problem you discover at audit.
So the fear is reasonable, and it deserves a straight answer rather than a reassuring one. Can you trust the codes an AI proposes? Only if you understand exactly how it behaves when it is unsure. Here is how ours is built to behave, and where a human still has to stand.
A code with no evidence is not a code
The single most important property of a trustworthy coding assistant is that it does not pull codes out of the air. Our coding assistant proposes codes directly from the documented encounter, and every proposed code is tied to the specific passage of documentation that supports it. The diagnosis code points to the line in the note where that diagnosis was assessed. The level of service points to the work that was actually recorded.
This matters for a practical reason. When a code carries its supporting evidence with it, a reviewer is not auditing a guess, they are checking a citation. You can see why the code was proposed in the same glance it takes to agree or override it. A code without a clear basis in the record is not a shortcut to fix later; it is exactly the thing the system is built to avoid producing.
”I’m not sure” is a feature, not a failure
The word people reach for is “hallucination”: a model stating something confidently that has no basis in the source. In coding, that would look like a plausible code the documentation does not actually support. It is the right thing to be afraid of.
The defense is not cleverness. It is restraint. The system is built to flag uncertainty or abstain rather than fabricate a code it cannot ground in the note. If the documentation does not clearly support a given code, the honest output is to say so and route it to a human, not to produce a tidy answer that papers over the gap. An assistant that occasionally says “this one needs your eyes” is far more trustworthy than one that is never visibly unsure, because the second kind is unsure just as often, it simply doesn’t tell you.
That is the trade we make deliberately. We would rather surface a question than manufacture a false certainty, because a flagged ambiguity costs you thirty seconds and a fabricated code can cost you a denial or a compliance finding.
Rules don’t negotiate
Language models are good at reading a clinical conversation and bad, on their own, at applying a rulebook the same way every single time. So we don’t ask one to. Between the model’s proposal and anything that reaches the record sit deterministic compliance checks: the same coding rules, applied the same way, on every encounter, without mood or drift.
The point of separating these is that a rule check should not be a matter of opinion. Bundling rules, coding edits, and documentation requirements are not things you want a probabilistic system improvising under pressure. They are things you want enforced. Keeping that layer deterministic is what lets the model do what it is genuinely good at, understanding the encounter, while the non-negotiable parts stay non-negotiable. (We don’t publish the internals of those checks, for the same reason you don’t publish the combination to a safe.)
The clinician is the final authority
Here is the line we will not blur: nothing is billed without the clinician’s review and sign-off. The platform proposes; the clinician decides. Every code, with its supporting evidence and any flags attached, is presented for a human to accept, change, or reject before it goes anywhere near a claim.
This is not a disclaimer bolted on for cover. It is the design. The assistant’s job is to do the legwork (read the full encounter, draft the coding, surface the supporting text and the open questions) so that the clinician’s review is fast and well-informed rather than a blank-page chore. The expertise that decides whether a code is right for this patient stays with the person who was in the room. Payers, in turn, adjudicate the claim; we don’t pretend to guarantee what they’ll pay.
What AI does well here, and what it doesn’t
Be clear-eyed about both. AI is genuinely good at reading a long, messy encounter and not missing the detail a tired human skims past at 7pm. It is good at consistency, at surfacing the documentation behind a code, at flagging the edits a person forgets to check. Used this way, it tends to make the record and the coding agree more often, not less.
What it should not do is decide alone. Clinical judgment about a specific patient, the call on a genuinely ambiguous case, final accountability for what gets submitted: those belong to a human, and a tool that claims otherwise is selling you risk dressed as convenience. We don’t guarantee coding accuracy or reimbursement, because no honest system can. What we can do is make the proposal transparent, the uncertainty visible, and the human’s job easier.
The honest takeaway
Trust in an AI coding assistant should not come from how confident it sounds. It should come from how it behaves when it is wrong or unsure: whether it grounds every code in your documentation, flags what it cannot support, runs the rules the same way every time, and hands the final decision to you. Guardrails over guesswork. That is the standard, and it is a standard you should hold every vendor to, us included.
If you want to see how that posture extends across the platform (clinician control, data handling, and our broader responsible-AI commitments), read how we approach security and clinician control.
Keep reading
More from the blog.
- Revenue cycle Medical coding
The Real Cost of a Denied Claim, and Where It Actually Starts
Denied claims cost far more to rework than to prevent. The denial economics, the rising trend, and the front-end errors that start it.
Pinotage Health 5 min read - Evaluation Ambient documentation
How to run an ambient scribe pilot, and what to actually measure
A vendor-neutral guide to running an ambient AI scribe pilot: define success, pick a cohort, baseline first, and measure what actually matters.
Pinotage Health 6 min read - EHR interoperability Ambient documentation
Will the ambient scribe actually work with our EHR? Six questions to ask any vendor
Read access is not write-back, and a demo is not an integration. Six questions that show whether an ambient AI scribe truly works with your EHR.
Pinotage Health 5 min read